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Foreword 


“The first comprehensive book covering the full spectrum of a young, fast-growing 
research field, graph neural networks (GNNs), written by authoritative authors!” 


Jiawei Han (Michael Aiken Chair Professor at University of Illinois at Urbana- 
Champaign, ACM Fellow and IEEE Fellow) 


“This book presents a comprehensive and timely survey on graph representation 
learning. Edited and contributed by the best group of experts in this area, this book 
is a must-read for students, researchers and pratictioners who want to learn anything 
about Graph Neural Networks.” 


Heung- Yeung ” Harry” Shum (Former Executive Vice President for Technology 
and Research at Microsoft Research, ACM Fellow, IEEE Fellow, FREng) 


“As the new frontier of deep learning, Graph Neural Networks offer great potential 
to combine probabilistic learning and symbolic reasoning, and bridge knowledge- 
driven and data-driven paradigms, nurturing the development of third-generation 
AI. This book provides a comprehensive and insightful introduction to GNN, rang- 
ing from foundations to frontiers, from algorithms to applications. It is a valuable 
resource for any scientist, engineer and student who wants to get into this exciting 
field.” 


Bo Zhang (Member of Chinese Academy of Science, Professor at Tsinghua Uni- 
versity) 


“Graph Neural Networks are one of the hottest areas of machine learning and this 
book is a wonderful in-depth resource covering a broad range of topics and applica- 
tions of graph representation learning.” 


Jure Leskovec (Associate Professor at Stanford University, and investigator at 
Chan Zuckerberg Biohub). 


“Graph Neural Networks are an emerging machine learning model that is already 
taking the scientific and industrial world by storm. The time is perfect to get in on the 
action — and this book is a great resource for newcomers and seasoned practitioners 
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alike! Its chapters are very carefully written by many of the thought leaders at the 
forefront of the area.” 


Petar Veličković (Senior Research Scientist, DeepMind) 


Preface 


The field of graph neural networks (GNNs) has seen rapid and incredible strides over 
the recent years. Graph neural networks, also known as deep learning on graphs, 
graph representation learning, or geometric deep learning, have become one of the 
fastest-growing research topics in machine learning, especially deep learning. This 
wave of research at the intersection of graph theory and deep learning has also influ- 
enced other fields of science, including recommendation systems, computer vision, 
natural language processing, inductive logic programming, program synthesis, soft- 
ware mining, automated planning, cybersecurity, and intelligent transportation. 

Although graph neural networks have achieved remarkable attention, it still faces 
many challenges when applying them into other domains, from the theoretical un- 
derstanding of methods to the scalability and interpretability in a real system, and 
from the soundness of the methodology to the empirical performance in an applica- 
tion. However, as the field rapidly grows, it has been extremely challenging to gain 
a global perspective of the developments of GNNs. Therefore, we feel the urgency 
to bridge the above gap and have a comprehensive book on this fast-growing yet 
challenging topic, which can benefit a broad audience including advanced under- 
graduate and graduate students, postdoctoral researchers, lecturers, and industrial 
practitioners. 

This book is intended to cover a broad range of topics in graph neural networks, 
from the foundations to the frontiers, and from the methodologies to the applica- 
tions. Our book is dedicated to introducing the fundamental concepts and algorithms 
of GNNs, new research frontiers of GNNs, and broad and emerging applications 
with GNNs. 


Book Website and Resources 


The website and further resources of this book can be found at: https: // 
graph-neural-networks.github.io/. The website provides online preprints 
and lecture slides of all the chapters. It also provides pointers to useful material and 
resources that are publicly available and relevant to graph neural networks. 
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To the Instructors 


The book can be used for a one-semester graduate course for graduate students. 
Though it is mainly written for students with a background in computer science, 
students with a basic understanding of probability, statistics, graph theory, linear 
algebra, and machine learning techniques such as deep learning will find it easily 
accessible. Some chapters can be skipped or assigned as homework assignments for 
reviewing purposes if students have knowledge of a chapter. For example, if students 
have taken a deep learning course, they can skip Chapter 1. The instructors can also 
choose to combine Chapters 1, 2, and 3 together as a background introduction course 
at the very beginning. 

When the course focuses more on the foundation and theories of graph neural net- 
works, the instructor can choose to focus more on Chapters 4-8 while using Chapters 
19-27 to showcase the applications, motivations, and limitations. Please refer to the 
Editors’ Notes at the end of each chapter on how Chapters 4-8 and Chapters 19-27 
are correlated. When the course focuses more on the research frontiers, Chapters 
9-18 can be the pivot to organize the course. For example, an instructor can make 
it an advanced graduate course where the students are asked to search and present 
the most recent research papers in each different research frontier. They can also 
be asked to establish their course projects based on the applications described in 
Chapters 19-27 as well as the materials provided on our website. 


To the Readers 


This book was designed to cover a wide range of topics in the field of graph neu- 
ral network field, including background, theoretical foundations, methodologies, re- 
search frontiers, and applications. Therefore, it can be treated as a comprehensive 
handbook for a wide variety of readers such as students, researchers, and profession- 
als. You should have some knowledge of the concepts and terminology associated 
with statistics, machine learning, and graph theory. Some backgrounds of the basics 
have been provided and referenced in the first eight chapters. You should better also 
have knowledge of deep learning and some programming experience for easily ac- 
cessing the most of chapters of this book. In particular, you should be able to read 
pseudocode and understand graph structures. 

The book is well modularized and each chapter can be learned in a standalone 
manner based on the individual interests and needs. For those readers who want 
to have a solid understanding of various techniques and theories of graph neural 
networks, you can start from Chapters 4-9. For those who further want to perform 
in-depth research and advance related fields, please read those chapters of interest 
among Chapters 9-18, which provide comprehensive knowledge in the most recent 
research issues, open problems, and research frontiers. For those who want to ap- 
ply graph neural networks to benefit specific domains, or aim at finding interesting 
applications to validate specific graph neural networks techniques, please refer to 
Chapters 19-27. 
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Terminologies 


This chapter describes a list of definitions of terminologies related to graph neural 
networks used throughout this book. 


1 Basic concepts of Graphs 


Graph: A graph is composed of a node set and an edge set, where nodes rep- 
resent entities and edges represent the relationship between entities. The nodes 
and edges form the topology structure of the graph. Besides the graph structure, 
nodes, edges, and/or the whole graph can be associated with rich information 
represented as node/edge/graph features (also known as attributes or contents). 
Subgraph: A subgraph is a graph whose set of nodes and set of edges are all 
subsets of the original graph. 

Centrality: A centrality is a measurement of the importance of nodes in the 
graph. The basic assumption of centrality is that a node is thought to be im- 
portant if many other important nodes also connect to it. Common centrality 
measurements include the degree centrality, the eigenvector centrality, the be- 
tweenness centrality, and the closeness centrality. 

Neighborhood: The neighborhood of a node generally refers to other nodes that 
are close to it. For example, the k-order neighborhood of a node, also called the 
k-step neighborhood, denotes a set of other nodes in which the shortest path 
distance between these nodes and the central node is no larger than k. 
Community Structure: A community refers to a group of nodes that are 
densely connected internally and less densely connected externally. 

Graph Sampling: Graph sampling is a technique to pick a subset of nodes and/ 
or edges from the original graph. Graph sampling can be applied to train ma- 
chine learning models on large-scale graphs while preventing severe scalability 
issues. 
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Heterogeneous Graphs: Graphs are called heterogeneous if the nodes and/or 
edges of the graph are from different types. A typical example of heteronomous 
graphs is knowledge graphs where the edges are composed of different types. 
Hypergraphs: Hypergraphs are generalizations of graphs in which an edge can 
join any number of nodes. 

Random Graph: Random graph generally aims to model the probability dis- 
tributions over graphs that the observed graphs are generated from. The most 
basic and well-studied random graph model, known as the Erdos—Renyi model, 
assumes that the node set is fixed and each edge is identically and independently 
generated. 

Dynamic Graph: Dynamic graph refers to when at least one component of the 
graph data changes over time, e.g., adding or deleting nodes, adding or deleting 
edges, changing edges weights or changing node attributes, etc. If graphs are 
not dynamic, we refer to them as static graphs. 


2 Machine Learning on Graphs 


Spectral Graph Theory: Spectral graph theory analyzes matrices associated 
with the graph such as its adjacency matrix or Laplacian matrix using tools of 
linear algebra such as studying the eigenvalues and eigenvectors of the matrix. 
Graph Signal Processing: Graph Signal Processing (GSP) aims to develop 
tools for processing signals defined on graphs. A graph signal refers to a finite 
collection of data samples with one sample at each node in the graph. 
Node-level Tasks: Node-level tasks refer to machine learning tasks associated 
with individual nodes in the graph. Typical examples of node-level tasks include 
node classification and node regression. 

Edge-level Tasks: Edge-level tasks refer to machine learning tasks associated 
with a pair of nodes in the graph. A typical example of an edge-level task in 
link prediction. 

Graph-level Tasks: Graph-level tasks refer to machine learning tasks associ- 
ated with the whole graph. Typical examples of graph-level tasks include graph 
classification and graph property prediction. 

Transductive and Inductive Learning: Transductive learning refers to that 
the targeted instances such as nodes or edges are observed at the training time 
(though the labels of the targeted instances remain unknown) and inductive 
learning aims to learn the model which is generalizable to unobserved instances. 


3 Graph Neural Networks 


Network embedding: The goal of network embedding is to represent each node 
in the graph as a low-dimensional vector so that useful information such as the 
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graph structures and some properties of the graph is preserved in the embedding 
vectors. Network embedding is also referred to as graph embedding and node 
representation learning. 

Graph Neural Network: Graph neural network refers to any neural network 
working on the graph data. 

Graph Convolutional Network: Graph convolutional network usually refers to 
a specific graph neural network proposed by Kipf and Welling Kipf and Welling 
(2017a). It is occasionally used as a synonym for graph neural network, i.e., 
referring to any neural network working on the graph data, in some literature. 
Message-Passing: Message-passing is a framework of graph neural networks in 
which the key step is to pass messages between different nodes based on graph 
structures in each neural network layer. The most widely adopted formulation, 
usually denoted as message-passing neural networks, is to only pass messages 
between nodes that are directly connected Gilmer et al (2017). The message 
passing functions are also called graph filters and graph convolutions in some 
literature. 

Readout: Readout refers to functions that summarize the information of indi- 
vidual nodes to form more high-level information such as forming a subgraph/super- 
graph or obtaining the representations of the entire graph. Readout is also called 
pooling and graph coarsening in some literature. 

Graph Adversarial Attack: Graph adversarial attacks aim to generate worst- 
case perturbations by manipulating the graph structure and/or node features so 
that the performance of some models are downgraded. Graph adversarial attacks 
can be categorized based on the attacker’s goals, capabilities, and accessible 
knowledge. 

Robustness certificates: Methods providing formal guarantees that the predic- 
tion of a GNN is not affected even when perturbations are performed based on 
a certain perturbation model. 


Notations 


This Chapter provides a concise reference that describes the notations used through- 
out this book. 


Numbers, Arrays, and Matrices 


A scalar 


x 
A vector x 

A matrix X 

An identity matrix I 

The set of real numbers R 

The set of complex numbers C 

The set of integers Z 

The set of real n-length vectors R” 

The set of real m x n matrices R™" 
The real interval including a and b a,b] 
The real interval including a but excluding b a,b) 
The element of the vector x with index i Xi 

The element of matrix X’s indexed by Row i and Column j Xij 
Graph Basics 

A graph G 

Edge set E 
Vertex set V 
Adjacent matrix of a graph A 
Laplacian matrix L 
Diagonal degree matrix D 
Isomorphism between graphs Y and # GH 
H is a subgraph of graph Y KH CG 
H is a proper subgraph of graph ¥ HCG 
Union of graphs # and Y G UH 
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Intersection of graphs # and Y 

Disjoint Union of graphs # and Y 

Cartesian Product of graphs of graphs .# and Y 
The join of graphs # and g 


Basic Operations 


Transpose of matrix X 
Dot product of matrices X and Y 


Element-wise (Hadamard) product of matrices X and Y 


Determinant of X 

p-norm (also called £, norm) of x 
Union 

Intersection 

Subset 

Proper subset 

Inner prodct of vector x and y 


Functions 


The function f with domain A and range 
Derivative of y with respect to x 

Partial derivative of y with respect to x 
Gradient of y with respect to x 

Matrix derivatives of y with respect to matrix X 
The Hessian matrix of function f at input vector x 
Definite integral over the entire domain of x 
Definite integral with respect to x over the set S 

A function of x parametrized by 0 

Convolution between functions f and g 


Probablistic Theory 
A probability distribution of a 


A conditional probabilistic distribution of b given a 
The random variables a and b are independent 


Variables a and b are conditionally independent given c 


Random variable a has a distribution p 


The expectation of f(a) with respect to the variable a under distri- E 


bution p 


Gaussian distribution over x with mean p and covariance £ 


Notations 


GVH 
GH 
Gx HA 
GN 4 
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Representation Learning 


Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei 


Abstract In this chapter, we first describe what representation learning is and why 
we need representation learning. Among the various ways of learning representa- 
tions, this chapter focuses on deep learning methods: those that are formed by the 
composition of multiple non-linear transformations, with the goal of resulting in 
more abstract and ultimately more useful representations. We summarize the repre- 
sentation learning techniques in different domains, focusing on the unique chal- 
lenges and models for different data types including images, natural languages, 
speech signals and networks. Last, we summarize this chapter. 


1.1 Representation Learning: An Introduction 


The effectiveness of machine learning techniques heavily relies on not only the de- 
sign of the algorithms themselves, but also a good representation (feature set) of 
data. Ineffective data representations that lack some important information or con- 
tains incorrect or huge redundant information could lead to poor performance of 
the algorithm in dealing with different tasks. The goal of representation learning is 
to extract sufficient but minimal information from data. Traditionally, this can be 
achieved via human efforts based on the prior knowledge and domain expertise on 
the data and tasks, which is also named as feature engineering. In deploying ma- 
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chine learning and many other artificial intelligence algorithms, historically a large 
portion of the human efforts goes into the design of prepossessing pipelines and data 
transformations. More specifically, feature engineering is a way to take advantage 
of human ingenuity and prior knowledge in the hope to extract and organize the dis- 
criminative information from the data for machine learning tasks. For example, po- 
litical scientists may be asked to define a keyword list as the features of social-media 
text classifiers for detecting those texts on societal events. For speech transcription 
recognition, one may choose to extract features from raw sound waves by the op- 
erations including Fourier transformations. Although feature engineering is widely 
adopted over the years, its drawbacks are also salient, including: 1) Intensive labors 
from domain experts are usually needed. This is because feature engineering may 
require tight and extensive collaboration between model developers and domain ex- 
perts. 2) Incomplete and biased feature extraction. Specifically, the capacity and 
discriminative power of the extracted features are limited by the knowledge of dif- 
ferent domain experts. Moreover, in many domains that human beings have limited 
knowledge, what features to extract itself is an open questions to domain experts, 
such as cancer early prediction. In order to avoid these drawbacks, making learn- 
ing algorithms less dependent on feature engineering has been a highly desired goal 
in machine learning and artificial intelligence domains, so that novel applications 
could be constructed faster and hopefully addressed more effectively. 

The techniques of representation learning witness the development from the tra- 
ditional representation learning techniques to more advanced ones. The traditional 
methods belong to “shallow” models and aim to learn transformations of data that 
make it easier to extract useful information when building classifiers or other pre- 
dictors, such as Principal Component Analysis (PCA) (Wold et al, 1987), Gaussian 
Markov random field (GMRF) (Rue and Held, 2005), and Locality Preserving Pro- 
jections (LPP) (He and Niyogi, 2004). Deep learning-based representation learning 
is formed by the composition of multiple non-linear transformations, with the goal 
of yielding more abstract and ultimately more useful representations. In the light of 
introducing more recent advancements and sticking to the major topic of this book, 
here we majorly focus on deep learning-based representation learning, which can 
be categorized into several types: (1) Supervised learning, where a large number of 
labeled data are needed for the training of the deep learning models. Given the well- 
trained networks, the output before the last fully-connected layers is always utilized 
as the final representation of the input data; (2) Unsupervised learning (including 
self-supervised learning), which facilitates the analysis of input data without corre- 
sponding labels and aims to learn the underlying inherent structure or distribution 
of data. The pre-tasks are utilized to explore the supervision information from large 
amounts of unlabelled data. Based on this constructed supervision information, the 
deep neural networks are trained to extract the meaningful representations for the 
future downstream tasks; (3) Transfer learning, which involves methods that utilize 
any knowledge resource (i.e., data, model, labels, etc.) to increase model learning 
and generalization for the target task. Transfer learning encompasses different sce- 
narios including multi-task learning (MTL), model adaptation, knowledge transfer, 
co-variance shift, etc. There are also other important representation learning meth- 
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ods such as reinforcement learning, few-shot learning, and disentangled representa- 
tion learning. 

It is important to define what is a good representation. As the definition by Ben- 
gio (2008), representation learning is about learning the (underlying) features of 
the data that make it easier to extract useful information when building classifiers or 
other predictors. Thus, the evaluation of a learned representation is closely related to 
its performance on the downstream tasks. For example, in the data generation task 
based on a generative model, a good representation is often the one that captures 
the posterior distribution of the underlying explanatory factors for the observed in- 
put. While for a prediction task, a good representation is the one that captures the 
minimal but sufficient information of input data to correctly predict the target label. 
Besides the evaluation from the perspective of the downstream tasks, there are also 
some general properties that the good representations may hold, such as the smooth- 
ness, the linearity, capturing multiple explanatory and casual factors, holding shared 
factors across different tasks and simple factor dependencies. 


1.2 Representation Learning in Different Areas 


In this section, we summarize the development of representation learning on four 
different representative areas: (1) image processing; (2) speech recognition; (3) Nat- 
ural language processing; and (4) network analysis. For the representation learning 
in each research area, we consider some of the fundamental questions that have been 
driving research in this area. Specifically, what makes one representation better than 
another, and how should we compute its representation? Why is the representation 
learning important in that area? Also, what are appropriate objectives for learning 
good representations? We also introduce the relevant typical methods and their de- 
velopment from the perspective of three main categories: supervised representation 
learning, unsupervised learning and transfer learning, respectively. 


1.2.1 Representation Learning for Image Processing 


Image representation learning is a fundamental problem in understanding the se- 
mantics of various visual data, such as photographs, medical images, document 
scans, and video streams. Normally, the goal of image representation learning for 
image processing is to bridge the semantic gap between the pixel data and semantics 
of the images. The successful achievements of image representation learning have 
enpowered many real-world problems, including but not limited to image search, 
facial recognition, medical image analysis, photo manipulation and target detection. 

In recent years, we have witnessed a fast advancement of image representation 
learning from handcrafted feature engineering to that from scratch through deep 
neural network models. Traditionally, the patterns of images are extracted with the 
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help of hand-crafted features by human beings based on prior knowledge. For exam- 
ple, Huang et al (2000) extracted the character’s structure features from the strokes, 
then use them to recognize the handwritten characters. Rui (2005) adopted the mor- 
phology method to improve local feature of the characters, then use PCA to ex- 
tract features of characters. However, all of these methods need to extract features 
from images manually and thus the prediction performances strongly rely on the 
prior knowledge. In the field of computer vision, manual feature extraction is very 
cumbersome and impractical because of the high dimensionality of feature vec- 
tors. Thus, representation learning of images which can automatically extract mean- 
ingful, hidden and complex patterns from high-dimension visual data is necessary. 
Deep learning-based representation learning for images is learned in an end-to-end 
fashion, which can perform much better than hand-crafted features in the target ap- 
plications, as long as the training data is of sufficient quality and quantity. 

Supervised Representation Learning for image processing. In the domain of im- 
age processing, supervised learning algorithm, such as Convolution Neural Network 
(CNN) and Deep Belief Network (DBN), are commonly applied in solving various 
tasks. One of the earliest deep-supervised-learning-based works was proposed in 
2006 (Hinton et al, 2006), which is focused on the MNIST digit image classifica- 
tion problem, outperforming the state-of-the-art SVMs. Following this, deep convo- 
lutional neural networks (ConvNets) showed amazing performance which is greatly 
depends on their properties of shift in-variance, weights sharing and local pattern 
capturing. Different types of network architectures were developed to increase the 
capacity of network models, and larger and larger datasets were collected these days. 
Various networks including AlexNet (Krizhevsky et al, 2012), VGG (Simonyan and 
Zisserman, 2014b), GoogLeNet (Szegedy et al, 2015), ResNet (He et al, 2016a), 
and DenseNet (Huang et al, 2017a) and large scale datasets, such as ImageNet and 
OpenImage, have been proposed to train very deep convolutional neural networks. 
With the sophisticated architectures and large-scale datasets, the performance of 
convolutional neural networks keeps outperforming the state-of-the-arts in various 
computer vision tasks. 

Unsupervised Representation Learning for image processing. Collection and an- 
notation of large-scale datasets are time-consuming and expensive in both image 
datasets and video datasets. For example, ImageNet contains about 1.3 million la- 
beled images covering 1,000 classes while each image is labeled by human workers 
with one class label. To alleviate the extensive human annotation labors, many unsu- 
pervised methods were proposed to learn visual features from large-scale unlabeled 
images or videos without using any human annotations. A popular solution is to 
propose various pretext tasks for models to solve, while the models can be trained 
by learning objective functions of the pretext tasks and the features are learned 
through this process. Various pretext tasks have been proposed for unsupervised 
learning, including colorizing gray-scale images (Zhang et al, 2016d) and image in- 
painting (Pathak et al, 2016). During the unsupervised training phase, a predefined 
pretext task is designed for the models to solve, and the pseudo labels for the pretext 
task are automatically generated based on some attributes of data. Then the models 
are trained according to the objective functions of the pretext tasks. When trained 
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with pretext tasks, the shallower blocks of the deep neural network models focus on 
the low-level general features such as corners, edges, and textures, while the deeper 
blocks focus on the high-level task-specific features such as objects, scenes, and 
object parts. Therefore, the models trained with pretext tasks can learn kernels to 
capture low-level features and high-level features that are helpful for other down- 
stream tasks. After the unsupervised training is finished, the learned visual features 
in this pre-trained models can be further transferred to downstream tasks (especially 
when only relatively small data is available) to improve performance and overcome 
over-fitting. 

Transfer Learning for image processing. In real-world applications, due to the 
high cost of manual labeling, sufficient training data that belongs to the same fea- 
ture space or distribution as the testing data may not always be accessible. Transfer 
learning mimics the human vision system by making use of sufficient amounts of 
prior knowledge in other related domains (i.e., source domains) when executing 
new tasks in the given domain (i.e., target domain). In transfer learning, both the 
training set and the test set can contribute to the target and source domains. In most 
cases, there is only one target domain for a transfer learning task, while either single 
or multiple source domains can exist. The techniques of transfer learning in im- 
ages processing can be categorized into feature representation knowledge transfer 
and classifier-based knowledge transfer. Specifically, feature representation trans- 
fer methods map the target domain to the source domains by exploiting a set of 
extracted features, where the data divergence between the target domain and the 
source domains can be significantly reduced so that the performance of the task 
in the target domain is improved. For example, classifier-based knowledge-transfer 
methods usually share the common trait that the learned source domain models are 
utilized as prior knowledge, which are used to learn the target model together with 
the training samples. Instead of minimizing the cross-domain dissimilarity by up- 
dating instances’ representations, classifier-based knowledge-transfer methods aim 
to learn a new model that minimizes the generalization error in the target domain 
via the provided training set from both domains and the learned model. 

Other Representation Learning for Image Processing. Other types of representa- 
tion learning are also commonly observed for dealing with image processing, such 
as reinforcement learning, and semi-supervised learning. For example, reinforce- 
ment learning are commonly explored in the task of image captioning Liu et al 
(2018a); Ren et al (2017) and image editing Kosugi and Yamasaki (2020), where 
the learning process is formalized as a sequence of actions based on a policy net- 
work. 
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1.2.2 Representation Learning for Speech Recognition 


Nowadays, speech interfaces or systems have become widely developed and inte- 
grated into various real-life applications and devices. Services like Siri |, Cortana ?, 
and Google Voice Search * have become a part of our daily life and are used by mil- 
lions of users. The exploration in speech recognition and analysis has always been 
motivated by a desire to enable machines to participate in verbal human-machine 
interactions. The research goals of enabling machines to understand human speech, 
identify speakers, and detect human emotion have attracted researchers’ attention 
for more than sixty years across several distinct research areas, including but not 
limited to Automatic Speech Recognition (ASR), Speaker Recognition (SR), and 
Speaker Emotion Recognition (SER). 

Analyzing and processing speech has been a key application of machine learning 
(ML) algorithms. Research on speech recognition has traditionally considered the 
task of designing hand-crafted acoustic features as a separate distinct problem from 
the task of designing efficient models to accomplish prediction and classification 
decisions. There are two main drawbacks of this approach: First, the feature engi- 
neering is cumbersome and requires human knowledge as introduced above; and 
second, the designed features might not be the best for the specific speech recog- 
nition tasks at hand. This has motivated the adoption of recent trends in the speech 
community towards the utilization of representation learning techniques, which can 
learn an intermediate representation of the input signal automatically that better fits 
into the task at hand and hence lead to improved performance. Among all these suc- 
cesses, deep learning-based speech representations play an important role. One of 
the major reasons for the utilization of representation learning techniques in speech 
technology is that speech data is fundamentally different from two-dimensional im- 
age data. Images can be analyzed as a whole or in patches, but speech has to be 
formatted sequentially to capture temporal dependency and patterns. 

Supervised representation learning for speech recognition. In the domain of 
speech recognition and analyzing, supervised representation learning methods are 
widely employed, where feature representations are learned on datasets by leverag- 
ing label information. For example, restricted Boltzmann machines (RBMs) (Jaitly 
and Hinton, 2011; Dahl et al, 2010) and deep belief networks (DBNs) (Cairong 
et al, 2016; Ali et al, 2018) are commonly utilized in learning features from speech 
for different tasks, including ASR, speaker recognition, and SER. For example, 
in 2012, Microsoft has released a new version of their MAVIS (Microsoft Audio 
Video Indexing Service) speech system based on context-dependent deep neural net- 
works (Seide et al, 2011). These authors managed to reduce the word error rate on 
four major benchmarks by about 30% (e.g., from 27.4% to 18.5% on RT03S) com- 


l Siri is an artificial intelligence assistant software that is built into Apple’s iOS system. 

2 Microsoft Cortana is an intelligent personal assistant developed by Microsoft, known as ”the 
world’s first cross-platform intelligent personal assistant’. 

3 Google Voice Search is a product of Google that allows you to use Google to search by speaking 
to a mobile phone or computer, that is, to use the legendary content on the device to be identified 
by the server, and then search for information based on the results of the recognition 
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pared to the traditional models based on Gaussian mixtures. Convolutional neural 
networks are another popular supervised models that are widely utilized for feature 
learning from speech signals in tasks such as speech and speaker recognition (Palaz 
et al, 2015a,b) and SER Latif et al (2019); Tzirakis et al (2018). Moreover, it has 
been found that LSTMs (or GRUs) can help CNNs in learning more useful features 
from speech by learning both the local and long-term dependency (Dahl et al, 2010). 

Unsupervised Representation Learning for speech recognition. Unsupervised 
representation learning from large unlabelled datasets is an active area of speech 
recognition. In the context of speech analysis, it is able to exploit the practically 
available unlimited amount of unlabelled corpora to learn good intermediate feature 
representations, which can then be used to improve the performance of a variety of 
downstream supervised learning speech recognition tasks or the speech signal syn- 
thetic tasks. In the tasks of ASR and SR, most of the works are based on Variational 
Auto-encoder (VAEs), where a generative model and an inference model are jointly 
learned, which allows them to capture latent representations from observed speech 
data (Chorowski et al, 2019; Hsu et al, 2019, 2017). For example, Hsu et al (2017) 
proposed a hierarchical VAE to capture interpretable and disentangled representa- 
tions from speech without any supervision. Other auto-encoding architectures like 
Denoised Autoencoder(DAEs) are also found very promising in finding speech rep- 
resentations in an unsupervised way, especially for noisy speech recognition (Feng 
et al, 2014; Zhao et al, 2015). Beyond the aforementioned, recently, adversarial 
learning (AL) is emerging as a powerful tool in learning unsupervised represen- 
tation for speech, such as generative adversarial nets (GANs). It involves at least 
a generator and a discriminator, where the former tries to generates as realistic as 
possible data to obfuscate the latter which also tries its best to deobfuscate. Hence 
both of the generator and discriminator can be trained and improved iteratively in 
an adversarial way, which result in more discriminative and robust features. Among 
these, GANs (Chang and Scherer, 2017; Donahue et al, 2018), adversarial autoen- 
coders (AAEs) Sahu et al (2017) are becoming mostly popular in modeling speech 
not only in ASR but also SR and SER. 

Transfer Learning for speech recognition. Transfer learning (TL) encompasses 
different approaches, including MTL, model adaptation, knowledge transfer, covari- 
ance shift, etc. In the domain of speech recognition, representation learning gained 
much interest in these approaches of TL including but not limited to domain adap- 
tation, multi-task learning, and self-taught learning. In terms of Domain Adaption, 
speech is a typical example of heterogeneous data and thus, a mismatch always ex- 
ists between the probability distributions of source and target domain data. To build 
more robust systems for speech-related applications in real-life, domain adaptation 
techniques are usually applied in the training pipeline of deep neural networks to 
learn representations which are able to explicitly minimize the difference between 
the distribution of data in the source and target domains (Sun et al, 2017; Swietojan- 
ski et al, 2016). In terms of MTL, representations learned can successfully increases 
the performance of speech recognition without requiring contextual speech data, 
since speech contains multi-dimensional information (message, speaker, gender, or 
emotion) that can be used as auxiliary tasks. For example, In the task of ASR, by us- 
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ing MTL with different auxiliary tasks including gender, speaker adaptation, speech 
enhancement, it has been shown that the learned shared representations for differ- 
ent tasks can act as complementary information about the acoustic environment and 
give a lower word error rate (WER) (Parthasarathy and Busso, 2017; Xia and Liu, 
2015). 

Other Representation Learning for speech recognition. Other than the above- 
mentioned three categories of representation learning for speech signals, there are 
also some other representation learning techniques commonly explored, such as 
semi-supervised learning and reinforcement learning. For example, in the speech 
recognition for ASR, semi-supervised learning is mainly used to circumvent the lack 
of sufficient training data. This can be achieved either by creating features fronts 
ends (Thomas et al, 2013), or by using multilingual acoustic representations (Cui 
et al, 2015), or by extracting an intermediate representation from large unpaired 
datasets (Karita et al, 2018). RL is also gaining interest in the area of speech recog- 
nition, and there have been multiple approaches to model different speech problems, 
including dialog modeling and optimization (Levin et al, 2000), speech recogni- 
tion (Shen et al, 2019), and emotion recognition (Sangeetha and Jayasankar, 2019). 


1.2.3 Representation Learning for Natural Language Processing 


Besides speech recognition, there are many other Natural Language Processing 
(NLP) applications of representation learning, such as the text representation learn- 
ing. For example, Google’s image search exploits huge quantities of data to map im- 
ages and queries in the same space (Weston et al, 2010) based on NLP techniques. 
In general, there are two types of applications of representation learning in NLP. 
In one type, the semantic representation, such as the word embedding, is trained 
in a pre-training task (or directly designed by human experts) and is transferred to 
the model for the target task. It is trained by using language modeling objective 
and is taken as inputs for other down-stream NLP models. In the other type, the 
semantic representation lies within the hidden states of the deep learning model and 
directly aims for better performance of the target tasks in an end-to-end fashion. For 
example, many NLP tasks want to semantically compose sentence or document rep- 
resentation, such as tasks like sentiment classification, natural language inference, 
and relation extraction, which require sentence representation. 

Conventional NLP tasks heavily rely on feature engineering, which requires care- 
ful design and considerable expertise. Recently, representation learning, especially 
deep learning-based representation learning is emerging as the most important tech- 
nique for NLP. First, NLP is typically concerned with multiple levels of language en- 
tries, including but not limited to characters, words, phrases, sentences, paragraphs, 
and documents. Representation learning is able to represent the semantics of these 
multi-level language entries in a unified semantic space, and model complex se- 
mantic dependence among these language entries. Second, there are various NLP 
tasks that can be conducted on the same input. For example, given a sentence, we 
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can perform multiple tasks such as word segmentation, named entity recognition, 
relation extraction, co-reference linking, and machine translation. In this case, it 
will be more efficient and robust to build a unified representation space of inputs 
for multiple tasks. Last, natural language texts may be collected from multiple do- 
mains, including but not limited to news articles, scientific articles, literary works, 
advertisement and online user-generated content such as product reviews and so- 
cial media. Moreover, texts can also be collected from different languages, such as 
English, Chinese, Spanish, Japanese, etc. Compared to conventional NLP systems 
which have to design specific feature extraction algorithms for each domain accord- 
ing to its characteristics, representation learning enables us to build representations 
automatically from large-scale domain data and even add bridges among these lan- 
guages from different domains. Given these advantages of representation learning 
for NLP in the feature engineering reduction and performance improvement, many 
researchers have developed efficient algorithms on representation learning, espe- 
cially deep learning-based approaches, for NLP. 

Supervised Representation Learning for NLP. Deep neural networks in the su- 
pervised learning setting for NLP emerge from distributed representation learning, 
then to CNN models, and finally to RNN models in recent years. At early stage, 
distributed representations are first developed in the context of statistical language 
modeling by Bengio (2008) in so-called neural net language models. The model 
is about learning a distributed representation for each word (1.e., word embedding). 
Following this, the need arose for an effective feature function that extracts higher- 
level features from constituting words or n-grams. CNNs turned out to be the nat- 
ural choice given their properties of excellent performance in computer vision and 
speech processing tasks. CNNs have the ability to extract salient n-gram features 
from the input sentence to create an informative latent semantic representation of 
the sentence for downstream tasks. This domain was pioneered by Collobert et al 
(2011) and Kalchbrenner et al (2014), which led to a huge proliferation of CNN- 
based networks in the succeeding literature. The neural net language model was also 
improved by adding recurrence to the hidden layers (Mikolov et al, 201 1a) (i.e., 
RNN), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in 
terms of perplexity (exponential of the average negative log-likelihood of predicting 
the right next word) but also in terms of WER in speech recognition. RNNs use 
the idea of processing sequential information. The term “recurrent” applies as they 
perform the same computation over each token of the sequence and each step is de- 
pendent on the previous computations and results. Generally, a fixed-size vector is 
produced to represent a sequence by feeding tokens one by one to a recurrent unit. In 
a way, RNNs have “memory” over previous computations and use this information 
in current processing. This template is naturally suited for many NLP tasks such 
as language modeling (Mikolov et al, 2010, 2011b), machine translation (Liu et al, 
2014; Sutskever et al, 2014), and image captioning (Karpathy and Fei-Fei, 2015). 

Unsupervised Representation Learning for NLP. Unsupervised learning (includ- 
ing self-supervised learning) has made a great success in NLP, for the plain text itself 
contains abundant knowledge and patterns about languages. For example, in most 
deep learning based NLP models, words in sentences are first mapped to their corre- 
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sponding embeddings via the techniques, such as word2vec Mikolov et al (2013b), 
GloVe Pennington et al (2014), and BERT Devlin et al (2019), before sending to 
the networks. However, there are no human-annotated “labels” for learning those 
word embeddings. To acquire the training objective necessary for neural networks, 
it is necessary to generate “labels” intrinsically from the existing data. Language 
modeling is a typical unsupervised learning task, which can construct the probabil- 
ity distribution over sequences of words and does not require human annotations. 
Based on the distributional hypothesis, using the language modeling objective can 
lead to hidden representations that encode the semantics of words. Another typi- 
cal unsupervised learning model in NLP is auto-encoder (AE), which consists of 
a reduction (encoding) phase and a reconstruction (decoding) phase. For example, 
recursive auto-encoders (which generalize recurrent networks with VAE) have been 
used to beat the state-of-the-art at the moment of its publication in full sentence 
paraphrase detection (Socher et al, 2011) by almost doubling the F1 score for para- 
phrase detection. 

Transfer Learning for NLP. Over the recent years, the field of NLP has wit- 
nessed fast growth of transfer learning methods via sequential transfer learning 
models and architectures, which significantly improved upon the state-of-the-arts 
on a wide range of NLP tasks. In terms of domain adaption, the sequential transfer 
learning consists of two stages: a pretraining phase in which general representa- 
tions are learned on a source task or domain followed by an adaptation phase during 
which the learned knowledge is applied to a target task or domain. The domain adap- 
tion in NLP is categorized into model-centric, data-centric, and hybrid approaches. 
Model-centric methods target the approaches to augmenting the feature space, as 
well as altering the loss function, the architecture, or the model parameters (Blitzer 
et al, 2006). Data-centric methods focus on the data aspect and involve pseudo- 
labeling (or bootstrapping) where only small number of classes are shared between 
the source and target datasets (Abney, 2007). Lastly, hybrid-based methods are built 
by both data- and model-centric models. Similarly, great advances have also been 
made into the multi-task learning in NLP, where different NLP tasks can result in 
better representation of texts. For example, based on a convolutional architecture, 
Collobert et al (2011) developed the SENNA system that shares representations 
across the tasks of language modeling, part-of-speech tagging, chunking, named en- 
tity recognition, semantic role labeling, and syntactic parsing. SENNA approaches 
or sometimes even surpasses the state-of-the-art on these tasks while is simpler and 
much faster than traditional predictors. Moreover, learning word embeddings can be 
combined with learning image representations in a way that allow associating texts 
and images. 

Other Representation Learning for NLP. In NLP tasks, when a problem gets 
more complicated, it requires more knowledge from domain experts to annotate 
training instances for fine-grained tasks and thus increases the cost of data labeling. 
Therefore, sometimes it requires the models or systems can be developed efficiently 
with (very) few labeled data. When each class has only one or a few labeled in- 
stances, the problem becomes a one/few-shot learning problem. The few-shot learn- 
ing problem is derived from computer vision and has also been studied in NLP 
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recently. For example, researchers have explored few-shot relation extractio (Han 
et al, 2018) where each relation has a few labeled instances, and low-resource ma- 
chine translation (Zoph et al, 2016) where the size of the parallel corpus is limited. 


1.2.4 Representation Learning for Networks 


Beyond popular data like images, texts, and sounds, network data is another im- 
portant data type that is becoming ubiquitous across a large scale of real-world ap- 
plications ranging from cyber-networks (e.g., social networks, citation networks, 
telecommunication networks, etc.) to physical networks (e.g., transportation net- 
works, biological networks, etc). Networks data can be formulated as graphs math- 
ematically, where vertices and their relationships jointly characterize the network 
information. Networks and graphs are very powerful and flexible data formulation 
such that sometimes we could even consider other data types like images, and texts 
as special cases of it. For example, images can be considered as grids of nodes with 
RGB attributes which are special types of graphs, while texts can also be organized 
into sequential-, tree-, or graph-structured information. So in general, representa- 
tion learning for networks is widely considered as a promising yet more challenging 
tasks that require the advancement and generalization of many techniques we devel- 
oped for images, texts, and so forth. In addition to the intrinsic high complexity of 
network data, the efficiency of representation learning on networks is also an impor- 
tant issues considering the large-scale of many real-world networks, ranging from 
hundreds to millions or even billions of vertices. Analyzing information networks 
plays a crucial role in a variety of emerging applications across many disciplines. 
For example, in social networks, classifying users into meaningful social groups is 
useful for many important tasks, such as user search, targeted advertising and recom- 
mendations; in communication networks, detecting community structures can help 
better understand the rumor spreading process; in biological networks, inferring in- 
teractions between proteins can facilitate new treatments for diseases. Nevertheless, 
efficient and effective analysis of these networks heavily relies on good representa- 
tions of the networks. 

Traditional feature engineering on network data usually focuses on obtaining a 
number of predefined straightforward features in graph levels (e.g., the diameter, 
average path length, and clustering co-efficient), node levels (e.g., node degree and 
centrality), or subgraph levels (e.g., frequent subgraphs and graph motifs). Those 
limited number of hand-crafted, well-defined features, though describe several fun- 
damental aspects of the graphs, discard the patterns that cannot be covered by them. 
Moreover, real-world network phenomena are usually highly complicated require 
sophisticated, unknown combinations among those predefined features or cannot be 
characterized by any of the existing features. In addition, traditional graph feature 
engineering usually involve expensive computations with super-linear or exponen- 
tial complexity, which often makes many network analytic tasks computationally 
expensive and intractable over large-scale networks. For example, in dealing with 


14 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei 


the task of community detection, classical methods involve calculating the spectral 
decomposition of a matrix with at least quadratic time complexity with respect to 
the number of vertices. This computational overhead makes algorithms hard to scale 
to large-scale networks with millions of vertices. 

More recently, network representation learning (NRL) has aroused a lot of re- 
search interest. NRL aims to learn latent, low-dimensional representations of net- 
work vertices, while preserving network topology structure, vertex content, and 
other side information. After new vertex representations are learned, network ana- 
lytic tasks can be easily and efficiently carried out by applying conventional vector- 
based machine learning algorithms to the new representation space. Earlier work 
related to network representation learning dates back to the early 2000s, when re- 
searchers proposed graph embedding algorithms as part of dimensionality reduction 
techniques. Given a set of independent and identically distributed (i.1.d.) data points 
as input, graph embedding algorithms first calculate the similarity between pairwise 
data points to construct an affinity graph, e.g., the k-nearest neighbor graph, and 
then embed the affinity graph into a new space having much lower dimensionality. 
However, graph embedding algorithms are designed on i.i.d. data mainly for dimen- 
sionality reduction purpose, which usually have at least quadratic time complexity 
with respect to the number of vertices. 

Since 2008, significant research efforts have shifted to the development of ef- 
fective and scalable representation learning techniques that are directly designed 
for complex information networks. Many network representation learning algo- 
rithms (Perozzi et al, 2014; Yang et al, 2015b; Zhang et al, 2016b; Manessi et al, 
2020) have been proposed to embed existing networks, showing promising per- 
formance for various applications. These methods embed a network into a latent, 
low-dimensional space that preserves structure proximity and attribute affinity. The 
resulting compact, low-dimensional vector representations can be then taken as fea- 
tures to any vector-based machine learning algorithms. This paves the way for a 
wide range of network analytic tasks to be easily and efficiently tackled in the new 
vector space, such as node classification (Zhu et al, 2007), link prediction (Lii and 
Zhou, 2011), clustering (Malliaros and Vazirgiannis, 2013), network synthesis (You 
et al, 2018b). The following chapters of this book will then provide a systematic and 
comprehensive introduction into network representation learning. 


1.3 Summary 


Representation learning is a very active and important field currently, which heavily 
influences the effectiveness of machine learning techniques. Representation learn- 
ing is about learning the representations of the data that makes it easier to extract 
useful and discriminative information when building classifiers or other predictors. 
Among the various ways of learning representations, deep learning algorithms have 
increasingly been employed in many areas nowadays where the good representation 
can be learned in an efficient and automatic way based on large amount of complex 
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and high dimensional data. The evaluation of a representation is closely related to its 
performance on the downstream tasks. Generally, there are also some general prop- 
erties that the good representations may hold, such as the smoothness, the linearity, 
disentanglement, as well as capturing multiple explanatory and casual factors. 

We have summarized the representation learning techniques in different domains, 
focusing on the unique challenges and models for different areas including the 
processing of images, natural language, and speech signals. For each area, there 
emerges many deep learning-based representation techniques from different cate- 
gories, including supervised learning, unsupervised learning, transfer learning, dis- 
entangled representation learning, reinforcement learning, etc. We have also briefly 
mentioned about the representation learning on networks and its relations to that on 
images, texts, and speech, in order for the elaboration of it in the following chapters. 


A 
Chapter 2 ml 
Graph Representation Learning 


Peng Cui, Lingfei Wu, Jian Pei, Liang Zhao and Xiao Wang 


Abstract Graph representation learning aims at assigning nodes in a graph to low- 
dimensional representations and effectively preserving the graph structures. Re- 
cently, a significant amount of progress has been made toward this emerging graph 
analysis paradigm. In this chapter, we first summarize the motivation of graph repre- 
sentation learning. Afterwards and primarily, we provide a comprehensive overview 
of a large number of graph representation learning methods in a systematic manner, 
covering the traditional graph representation learning, modern graph representation 
learning, and graph neural networks. 


2.1 Graph Representation Learning: An Introduction 


Many complex systems take the form of graphs, such as social networks, biological 
networks, and information networks. It is well recognized that graph data is often 
sophisticated and thus is challenging to deal with. To process graph data effectively, 
the first critical challenge is to find effective graph data representation, that is, how 
to represent graphs concisely so that advanced analytic tasks, such as pattern discov- 
ery, analysis, and prediction, can be conducted efficiently in both time and space. 
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Traditionally, we usually represent a graph as Y = (V,&), where V is a node set 
and & is an edge set. For large graphs, such as those with billions of nodes, the 
traditional graph representation poses several challenges to graph processing and 
analysis. 

(1) High computational complexity. These relationships encoded by the edge 
set E take most of the graph processing or analysis algorithms either iterative or 
combinatorial computation steps. For example, a popular way is to use the shortest 
or average path length between two nodes to represent their distance. To compute 
such a distance using the traditional graph representation, we have to enumerate 
many possible paths between two nodes, which is in nature a combinatorial prob- 
lem. Such methods result in high computational complexity that prevents them from 
being applicable to large-scale real-world graphs. 

(2) Low parallelizability. Parallel and distributed computing is de facto to pro- 
cess and analyze large-scale data. Graph data represented in the traditional way, 
however, casts severe difficulties to design and implementat of parallel and dis- 
tributed algorithms. The bottleneck is that nodes in a graph are coupled to each 
other explicitly reflected by E. Thus, distributing different nodes in different shards 
or servers often causes demandingly high communication cost among servers, and 
holds back speed-up ratio. 

(3) Inapplicability of machine learning methods. Recently, machine learning 
methods, especially deep learning, are very powerful in many areas. For graph data 
represented in the traditional way, however, most of the off-the-shelf machine learn- 
ing methods may not be applicable. Those methods usually assume that data sam- 
ples can be represented by independent vectors in a vector space, while the samples 
in graph data (i.e., the nodes) are dependant to each other to some degree determined 
by E. Although we can simply represent a node by its corresponding row vector in 
the adjacency matrix of the graph, the extremely high dimensionality of such a rep- 
resentation in a large graph with many nodes makes the in sequel graph processing 
and analysis difficult. 

To tackle these challenges, substantial effort has been committed to develop 
novel graph representation learning, i.e., learning the dense and continuous low- 
dimensional vector representations for nodes, so that the noise or redundant infor- 
mation can be reduced and the intrinsic structure information can be preserved. In 
the learned representation space, the relationships among the nodes, which were 
originally represented by edges or other high-order topological measures in graphs, 
are captured by the distances between nodes in the vector space, and the structural 
characteristics of a node are encoded into its representation vector. 

Basically, in order to make the representation space well supporting graph anal- 
ysis tasks, there are two goals for graph representation learning. First, the original 
graph can be reconstructed from the learned representation space. It requires that, if 
there is an edge or relationship between two nodes, then the distance of these two 
nodes in the representation space should be relatively small. Second, the learned rep- 
resentation space can effectively support graph inference, such as predicting unseen 
links, identifying important nodes, and inferring node labels. It should be noted that 
a representation space with only the goal of graph reconstruction is not sufficient 
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for graph inference. After the representation is obtained, downstream tasks such as 
node classification , node clustering , graph visualization and link prediction can be 
dealt with based on these representations. Overall, there are three main categories of 
graph representation learning methods: traditional graph embedding, modern graph 
embedding, and graph neural networks, which will be introduced separately in the 
following three sections. 


2.2 Traditional Graph Embedding 


Traditional graph embedding methods are originally studied as dimension reduction 
techniques. A graph is usually constructed from a feature represented data set, like 
image data set. As mentioned before, graph embedding usually has two goals, i.e. 
reconstructing original graph structures and support graph inference. The objective 
functions of traditional graph embedding methods mainly target the goal of graph 
reconstruction. 

Specifically, Tenenbaum et al (2000) first constructs a neighborhood graph G us- 
ing connectivity algorithms such as K nearest neighbors (KNN). Then based on G, 
the shortest path between different data can be computed. Consequently, for all the 
N data entries in the data set, we have the matrix of graph distances. Finally, the 
classical multidimensional scaling (MDS) method is applied to the matrix to obtain 
the coordinate vectors. The representations learned by Isomap approximately pre- 
serve the geodesic distances of the entry pairs in the low-dimensional space. The key 
problem of Isomap is its high complexity due to the computing of pair-wise short- 
est pathes. Locally linear embedding (LLE) (Roweis and Saul, 2000) is proposed 
to eliminate the need to estimate the pairwise distances between widely separated 
entries. LLE assumes that each entry and its neighbors lie on or close to a locally 
linear patch of a mainfold. To characterize the local geometry, each entry can be 
reconstructed from its neighbors. Finally, in the low-dimensional space, LLE con- 
structs a neighborhood-preserving mapping based on locally linear reconstruction. 
Laplacian eigenmaps (LE) (Belkin and Niyogi, 2002) also begins with construct- 
ing a graph using €-neighborhoods or K nearest neighbors. Then the heat kernel 
(Berline et al, 2003) is utilized to choose the weight of two nodes in the graph. Fi- 
nally, the node representations can be obtained by based on the Laplacian matrix 
regularization. Furthermore, the locality preserving projection (LPP) (Berline et al, 
2003), a linear approximation of the nonlinear LE, is proposed. 

These methods are extended in the rich literature of graph embedding by consid- 
ering different characteristics of the constructed graphs (Fu and Ma, 2012). We can 
find that traditional graph embedding mostly works on graphs constructed from fea- 
ture represented data sets, where the proximity among nodes encoded by the edge 
weights is well defined in the original feature space. While, in contrast, modern 
graph embedding, which will be introduced in the following, mostly works on natu- 
rally formed networks, such as social networks, biology networks, and e-commerce 
networks. In those networks, the proximities among nodes are not explicitly or di- 
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rectly defined. For example, an edge between two nodes usually just implies there is 
a relationship between them, but cannot indicate the specific proximity. Also, even 
if there is no edge between two nodes, we cannot say the proximity between these 
two nodes is zero. The definition of node proximities depends on specific analytic 
tasks and application scenarios. Therefore, modern graph embedding usually incor- 
porates rich information, such as network structures, properties, side information 
and advanced information, to facilitate different problems and applications. Modern 
graph embedding needs to target both of goals mentioned before. In view of this, 
traditional graph embedding can be regarded as a special case of modern graph em- 
bedding, and the recent research progress on modern graph embedding pays more 
attention to network inference. 


2.3 Modern Graph Embedding 


To well support network inference, modern graph embedding considers much richer 
information in a graph. According to the types of information that are preserved in 
graph representation learning, the existing methods can be categorized into three 
categories: (1) graph structures and properties preserving graph embedding, (2) 
graph representation learning with side information and (3) advanced information 
preserving graph representation learning. In technique view, different models are 
adopted to incorporate different types of information or address different goals. The 
commonly used models include matrix factorization, random walk, deep neural net- 
works and their variations. 


2.3.1 Structure-Property Preserving Graph Representation 
Learning 


Among all the information encoded in a graph, graph structures and properties are 
two crucial factors that largely affect graph inference. Thus, one basic requirement 
of graph representation learning is to appropriately preserve graph structures and 
capture properties of graphs. Often, graph structures include first-order structures 
and higher-order structures, such as second-order structures and community struc- 
tures. Graphs with different types have different properties. For example, directed 
graphs have the asymmetric transitivity property. The structural balance theory is 
widely applicable to signed graphs. 


2.3.1.1 Structure Preserving Graph Representation Learning 


Graph structures can be categorized into different groups that present at differ- 
ent granularities. The commonly exploited graph structures in graph representation 
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learning include neighborhood structure, high-order node proximity and graph com- 
munities. 

How to define the neighborhood structure in a graph is the first challenge. Based 
on the discovery that the distribution of nodes appearing in short random walks is 
similar to the distribution of words in natural language, DeepWalk (Perozzi et al, 
2014) employs the random walks to capture the neighborhood structure. Then for 
each walk sequence generated by random walks, following Skip-Gram, DeepWalk 
aims to maximize the probability of the neighbors of a node in a walk sequence. 
Node2vec defines a flexible notion of a node’s graph neighborhood and designs 
a second order random walks strategy to sample the neighborhood nodes, which 
can smoothly interpolate between breadth-first sampling (BFS) and depth-first sam- 
pling (DFS). Besides the neighborhood structure, LINE (Tang et al, 2015b) is pro- 
posed for large scale network embedding, which can preserve the first and second 
order proximities. The first order proximity is the observed pairwise proximity be- 
tween two nodes. The second order proximity is determined by the similarity of 
the “contexts” (neighbors) of two nodes. Both are important in measuring the re- 
lationships between two nodes. Essentially, LINE is based on the shallow model, 
consequently, the representation ability is limited. SDNE (Wang et al, 2016) pro- 
poses a deep model for network embedding, which also aims at capturing the first 
and second order proximites. SDNE uses the deep auto-encoder architecture with 
multiple non-linear layers to preserve the second order proximity. To preserve the 
first-order proximity, the idea of Laplacian eigenmaps (Belkin and Niyogi, 2002) 
is adopted. Wang et al (2017g) propose a modularized nonnegative matrix factor- 
ization (M-NMF) model for graph representation learning, which aims to preserve 
both the microscopic structure, i.e., the first-order and second-order proximities of 
nodes, and the mesoscopic community structure (Girvan and Newman, 2002). They 
adopt the NMF model (Févotte and Idier, 2011) to preserve the microscopic struc- 
ture. Meanwhile, the community structure is detected by modularity maximization 
(Newman, 2006a). Then, they introduce an auxiliary community representation ma- 
trix to bridge the representations of nodes with the community structure. In this 
way, the learned representations of nodes are constrained by both the microscopic 
structure and community structure. 

In summary, many network embedding methods aim to preserve the local struc- 
ture of a node, including neighborhood structure, high-order proximity as well as 
community structure, in the latent low-dimensional space. Both linear and non- 
linear models are attempted, demonstrating the large potential of deep models in 
network embedding. 


2.3.1.2 Property Preserving Graph Representation Learning 
Currently, most of the existing property preserving graph representation learning 


methods focus on graph transitivity in all types of graphs and the structural balance 
property in signed graphs. 
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We usually demonstrate that the transitivity usually exists in a graph. But mean- 
while, we can find that preserving such a property is not challenging, because in a 
metric space, the distance between different data points naturally satisfies the trian- 
gle inequality. However, this is not always true in the real world. Ou et al (2015) aim 
to preserve the non-transitivity property via latent similarity components. The non- 
transitivity property declares that, for nodes vı, v2 and v3 in a graph where (v1; v2) 
and (v2;v3) are similar pairs, (vj;v3) may be a dissimilar pair. For example, in a 
social network, a student may connect with his classmates and his family, while his 
classmates and family are probably very different. The main idea is that they learn 
multiple node embeddings, and then compare different nodes based on multiple 
similarities, rather than one similarity. They observe that if two nodes have a large 
semantic similarity, at least one of the structure similarities is large, otherwise, all 
of the similarities are small. In a directed graph, it usually has the asymmetric tran- 
sitivity property. Asymmetric transitivity indicates that, if there is a directed edge 
from node i to node j and a directed edge from j to v, there is likely a directed edge 
from i to v, but not from v to i. In order to measure this high-order proximity, HOPE 
(Ou et al, 2016) summarizes four measurements in a general formulation, and then 
utilizes a generalized SVD problem to factorize the high-order proximity (Paige and 
Saunders, 1981), such that the time complexity of HOPE is largely reduced, which 
means HOPE is scalable for large scale networks. In a signed graph with both of 
positive and negative edges, the social theories, such as structural balance theory 
(Cartwright and Harary, 1956; Cygan et al, 2012), which are very different from the 
unsigned graph. The structural balance theory demonstrates that users in a signed 
social network should be able to have their “friends” closer than their “foes”. To 
model the structural balance phenomenon, SiNE (Wang et al, 2017f) utilizes a deep 
learning model consisting of two deep graphs with non-linear functions. 

The importance of maintaining network properties in network embedding space, 
especially the properties that largely affect the evolution and formation of networks, 
has been well recognized. The key challenge is how to address the disparity and het- 
erogeneity of the original network space and the embedding vector space at property 
level. Generally, most of the structure and property preserving methods take high 
order proximities of nodes into account, which demonstrate the importance of pre- 
serving high order structures in network embedding. The difference is the strategy 
of obtaining the high order structures. Some methods implicitly preserve highorder 
structure by assuming a generative mechanism from a node to its neighbors, while 
some other methods realize this by explicitly approximating high-order proximities 
in the embedding space. As topology structures are the most notable characteristic 
of networks, structure-preserving network methods embody a large part of the lit- 
erature. Comparatively, property preserving network embedding is a relatively new 
research topic and is only studied lightly. As network properties usually drive the 
formation and evolution of networks, it shows great potential for future research and 
applications. 
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2.3.2 Graph Representation Learning with Side Information 


Besides graph structures, side information is another important information source 
for graph representation learning. Side information in the context of graph represen- 
tation learning can be divided into two categories: node content and types of nodes 
and edges. Their difference is the way of integrating network structures and side 
information. 

Graph Representation Learning with Node Content. In some types of graphs, 
like information networks, nodes are acompanied with rich information, such as 
node labels, attributes or even semantic descriptions. How to combine them with 
the network topology in graph representation learning arouses considerable research 
interests. Tu et al (2016) propose a semi-supervised graph embedding algorithm, 
MMDW, by leveraging labeling information of nodes. MMDW is also based on the 
DeepWalk-derived matrix factorization. MMDW adopts support vector machines 
(SVM) (Hearst et al, 1998) and incorporates the label information to find an optimal 
classifying boundary. Yang et al (2015b) propose TADW that takes the rich informa- 
tion (e.g., text) associated with nodes into account when they learn the low dimen- 
sional representations of nodes. Pan et al (2016) propose a coupled deep model that 
incorporates graph structures, node attributes and node labels into graph embedding. 
Although different methods adopt different strategies to integrate node content and 
network topology, they all assume that node content provides additional proximity 
information to constrain the representations of nodes. 

Heterogeneous Graph Representation Learning. Different from graphs with node 
content, heterogeneous graphs consist of different types of nodes and links. How to 
unify the heterogeneous types of nodes and links in graph embedding is also an 
interesting and challenging problem. Jacob et al (2014) propose a heterogeneous 
social graph representation learning algorithm for classifying nodes. They learn the 
representations of all types of nodes in a common vector space, and perform the 
inference in this space. Chang et al (2015) propose a deep graph representation 
learning algorithm for heterogeneous graphs, whose nodes have various types(e.g., 
images and texts). The nonlinear embeddings of images and texts are learned by 
a CNN model and the fully connected layers, respectively. Huang and Mamoulis 
(2017) propose a meta path similarity preserving heterogeneous information graph 
representation learning algorithm. To model a particular relationship, a meta path 
(Sun et al, 2011) is a sequence of object types with edge types in between. 

In the methods preserving side information, side information introduces addi- 
tional proximity measures so that the relationships between nodes can be learned 
more comprehensively. Their difference is the way of integrating network struc- 
turess and side information. Many of them are naturally extensions from structure 
preserving network embedding methods. 
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2.3.3 Advanced Information Preserving Graph Representation 
Learning 


Different from side information, the advanced information refers to the supervised 
or pseudo supervised information in a specific task. The advanced information pre- 
serving network embedding usually consists of two parts. One is to preserve the 
network structure so as to learn the representations of nodes. The other is to estab- 
lish the connection between the representations of nodes and the target task. The 
combination of advanced information and network embedding techniques enables 
representation learning for networks. 

Information Diffusion. Information diffusion (Guille et al, 2013) is an ubiquitous 
phenomenon on the web, especially in social networks. Bourigault et al (2014) pro- 
pose a graph representation learning algorithm for predicting information diffusion 
in social network. The goal of the proposed algorithm is to learn the representations 
of nodes in the latent space such that the diffusion kernel can best explain the cas- 
cades in the training set. The basic idea is to map the observed information diffusion 
process into a heat diffusion process modeled by a diffusion kernel in the continu- 
ous space. The kernel describes that the closer a node in the latent space is from 
the source node, the sooner it is infected by information from the source node. The 
cascade prediction problem here is defined as predicting the increment of cascade 
size after a given time interval (Li et al, 2017a). Li et al (2017a) argue that the pre- 
vious work on cascade prediction all depends on the bag of hand-crafting features 
to represent the cascade and graph structures. Instead, they present an end-to-end 
deep learning model to solve this problem using the idea of graph embedding. The 
whole procedure is able to learn the representation of cascade graph in an end-to-end 
manner. 

Anomaly Detection. Anomaly detection has been widely investigated in previous 
work (Akoglu et al, 2015). Anomaly detection in graphs aims to infer the structural 
inconsistencies, which means the anomalous nodes that connect to various diverse 
influential communities (Hu et al, 2016), (Burt, 2004). Hu et al (2016) propose a 
graph embedding based method for anomaly detection. They assume that the com- 
munity memberships of two linked nodes should be similar. An anomaly node is 
one connecting to a set of different communities. Since the learned embedding of 
nodes captures the correlations between nodes and communities, based on the em- 
bedding, they propose a new measure to indicate the anomalousness level of a node. 
The larger the value of the measure, the higher the propensity for a node being an 
anomaly node. 

Graph Alignment. The goal of graph alignment is to establish the correspon- 
dence between the nodes from two graphs, i.e., to predict the anchor links across 
two graphs. The same users who are shared by different social networks naturally 
form the anchor links, and these links bridge the different graphs. The anchor link 
prediction problem is, given a source graph,a target graph and a set of observed 
anchor links, to identify the hidden anchor links across the two graphs. Man et al 
(2016) propose a graph representation learning algorithm to solve this problem. The 
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learned representations can preserve the graph structures and respect the observed 
anchor links. 

Advanced information preserving graph embedding usually consists of two parts. 
One is to preserve the graph structures so as to learn the representations of nodes. 
The other is to establish the connection between the representations of nodes and the 
target task. The first one is similar to structure and property preserving network em- 
bedding, while the second one usually needs to consider the domain knowledge of a 
specific task. The domain knowledge encoded by the advanced information makes 
it possible to develop end-to-end solutions for network applications. Compared with 
the hand-crafted network features, such as numerous network centrality measures, 
the combination of advanced information and network embedding techniques en- 
ables representation learning for networks. Many network applications may be ben- 
efitted from this new paradigm. 


2.4 Graph Neural Networks 


Over the past decade, deep learning has become the “crown jewel” of artificial intel- 
ligence and machine learning, showing superior performance in acoustics, images 
and natural language processing, etc. Although it is well known that graphs are ubiq- 
uitous in the real world, it is very challenging to utilize deep learning methods to 
analyze graph data. This problem is non-trivial because of the following challenges: 
(1) Irregular structures of graphs. Unlike images, audio, and text, which have a clear 
grid structure, graphs have irregular structures, making it hard to generalize some 
of the basic mathematical operations to graphs. For example, defining convolution 
and pooling operations, which are the fundamental operations in convolutional neu- 
ral networks (CNNs), for graph data is not straightforward. (2) Heterogeneity and 
diversity of graphs. A graph itself can be complicated, containing diverse types and 
properties. These diverse types, properties, and tasks require different model archi- 
tectures to tackle specific problems. (3) Large-scale graphs. In the big-data era, real 
graphs can easily have millions or billions of nodes and edges. How to design scal- 
able models, preferably models that have a linear time complexity with respect to the 
graph size, is a key problem. (4) Incorporating interdisciplinary knowledge. Graphs 
are often connected to other disciplines, such as biology, chemistry, and social sci- 
ences. This interdisciplinary nature provides both opportunities and challenges: do- 
main knowledge can be leveraged to solve specific problems but integrating domain 
knowledge can complicate model designs. 

Currently, graph neural networks have attracted considerable research attention 
over the past several years. The adopted architectures and training strategies vary 
greatly, ranging from supervised to unsupervised and from convolutional to re- 
cursive, including graph recurrent neural networks (Graph RNNs), graph convo- 
lutional networks (GCNs), graph autoencoders (GAEs), graph reinforcement learn- 
ing (Graph RL), and graph adversarial methods. Specifically, Graroperty h RNNs 
capture recursive and sequential patterns of graphs by modeling states at either the 
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node-level or the graph-level; GCNs define convolution and readout operations on 
irregular graph structures to capture common local and global structural patterns; 
GAEs assume low-rank graph structures and adopt unsupervised methods for node 
representation learning; Graph RL defines graph-based actions and rewards to ob- 
tain feedbacks on graph tasks while following constraints; Graph adversarial meth- 
ods adopt adversarial training techniques to enhance the generalization ability of 
graphbased models and test their robustness by adversarial attacks. 

There are many ongoing or future research directions which are also worthy of 
further study, including new models for unstudied graph structures, compositional- 
ity of existing models, dynamic graphs, interpretability and robustness, etc. On the 
whole, deep learning on graphs is a promising and fast-developing research field 
that both offers exciting opportunities and presents many challenges. Studying deep 
learning on graphs constitutes a critical building block in modeling relational data, 
and it is an important step towards a future with better machine learning and artifi- 
cial intelligence techniques. 


2.5 Summary 


In this chapter, we introduce the motivation of graph representation learning. Then 
in Section 2, we discuss the traditional graph embedding methods and the mod- 
ern graph embedding methods are introduced in Section 3. Basically, the structure 
and property preserving graph representation learning is the foundation. If one can- 
not preserve well the graph structures and retain the important graph properties in 
the representation space, serious information will be lost, which hurts the analytic 
tasks in sequel. Based on the structures and property preserving graph representation 
learning, one may apply the off-the-shelf machine learning methods. If some side 
information is available, it can be incorporated into graph representation learning. 
Furthermore, the domain knowledge of some certain applications as advanced infor- 
mation can be considered. As shown in Section 4, utilizing deep learning methods 
on graphs is a promising and fast-developing research field that both offers excit- 
ing opportunities and presents many challenges. Studying deep learning on graphs 
constitutes a critical building block in modeling relational data, and it is an impor- 
tant step towards a future with better machine learning and artificial intelligence 
techniques. 
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3.1 Graph Neural Networks: An Introduction 


Deep Learning has become one of the most dominant approaches in Artificial In- 
telligence research today. Conventional deep learning techniques, such as recurrent 
neural networks (Schuster and Paliwal, 1997) and convolutional neural networks 
(Krizhevsky et al, 2012) have achieved huge successes on Euclidean data such as 
images, or sequence data such as text and signals. However, in arich variety of scien- 
tific fields, many important real-world objects and problems can be naturally or best 
expressed along with a complex structure, e.g., graph or manifold structure, such 
as social networks, recommendation systems, drug discovery and program analy- 
sis. On the one hand, these graph-structured data can encode complicated pairwise 
relationships for learning more informative representations; On the other hand, the 
structural and semantic information in original data (images or sequential texts) 
can be exploited to incorporate domain-specific knowledge for capturing more fine- 
grained relationships among the data. 

In recent years, deep learning on graphs has experienced a burgeoning inter- 
est from the research community (Cui et al, 2018; Wu et al, 2019e; Zhang et al, 
2020e). Among them, Graph Neural Networks (GNNs) is the most successful learn- 
ing framework in coping with various tasks across a large number of application do- 
mains. Newly proposed neural network architectures on graph-structured data (Kipf 
and Welling, 2017a; Petar et al, 2018; Hamilton et al, 2017b) have achieved remark- 
able performance in some well-known domains such as social networks and bioin- 
formatics. They have also infiltrated other fields of scientific research, including 
recommendation systems (Wang et al, 2019j), computer vision (Yang et al, 2019g), 
natural language processing (Chen et al, 20200), program analysis (Allamanis et al, 
2018b), software mining (LeClair et al, 2020), drug discovery (Ma et al, 2018), 
anomaly detection (Markovitz et al, 2020), and urban intelligence (Yu et al, 2018a). 

Despite these successes that existing research has achieved, GNNs still face many 
challenges when they are used to model highly-structured data that is time-evolving, 
multi-relational, and multi-modal. It is also very difficult to model mapping between 
graphs and other highly structured data, such as sequences, trees, and graphs. One 
challenge with graph-structured data is that it does not show as much spatial locality 
and structure as image or text data does. Thus, graph-structured data is not naturally 
suitable for highly regularized neural structures such as convolutional and recurrent 
neural networks. 

More importantly, new application domains for GNNs that emerge from real- 
world problems introduce significantly challenges for GNNs. Graphs provide a pow- 
erful abstraction that can be used to encode arbitrary data types such as multidi- 
mensional data. For example, similarity graphs, kernel matrices, and collaborative 
filtering matrices can also be viewed as special cases of graph structures. Therefore, 
a successful modeling process of graphs is likely to subsume many applications that 
are often used in conjunction with specialized and hand-crafted methods. 

In this chapter, we will systematically organize the existing research of GNNs 
along three axes: foundations of GNNs, frontiers of GNNs, and GNN based applica- 
tions. First of all, we will introduce the fundamental aspects of GNNs ranging from 
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popular GNN methods and their expressive powers, to the scalability, interpretabil- 
ity, and robustness of GNNs. Next, we will discuss various frontier research which 
are built on GNNs, including graph classification, link prediction, graph generation 
and transformation, graph matching, graph structure learning, dynamic GNNs, het- 
erogeneous GNNs, AutoML of GNNs and self-supervised GNNs. Based on them, 
we further summarize the basic procedures which exploit full use of various GNNs 
for a large number of applications. Finally, we provide the organization of our GNN 
book and summarize the roadmap of the various research topics of GNNs. 


3.2 Graph Neural Networks: Overview 


In this section, we summarize the development of graph neural networks along three 
important dimensions: (1) Foundations of GNNs; (2) Frontiers of GNNs; (3) GNN- 
based applications. We will first discuss the important research areas under the first 
two dimensions for GNNs and briefly illustrate the current progress and challenges 
for each research sub-domain. Then we will provide a general summarization on 
how to exploit the power of GNNs for a rich variety of applications. 


3.2.1 Graph Neural Networks: Foundations 


Conceptually, we can categorize the fundamental learning tasks of GNNs into five 
different directions: i) Graph Neural Networks Methods; ii) Theoretical understand- 
ing of Graph Neural Networks; iii) Scalability of Graph Neural Networks; iv) In- 
terpretability of Graph Neural Networks; and v) Adversarial robustness of Graph 
Neural Networks. We will discuss these fundamental aspects of GNNs one by one 
in this subsection. 

Graph Neural Network Methods. Graph Neural Networks are specifically de- 
signed neural architectures operated on graph-structure data. The goal of GNNs is 
to iteratively update the node representations by aggregating the representations of 
node neighbors and their own representation in the previous iteration. There are 
a variety of graph neural networks proposed in the literature (Kipf and Welling, 
2017a; Petar et al, 2018; Hamilton et al, 2017b; Gilmer et al, 2017; Xu et al, 2019d; 
Velickovic et al, 2019; Kipf and Welling, 2016), which can be further categorized 
into supervised GNNs and unsupervised GNNs. Once the node representations are 
learnt, a fundamental task on graphs is node classification that tries to classify the 
nodes into a few predefined classes. Despite the huge successes that various GNNs 
have achieved, a severe issue on training deep graph neural networks has been ob- 
served to yield inferior results, namely, over-smoothing problem (Li et al, 2018b), 
where all the nodes have similar representations. Many recent works have been pro- 
posed with different remedies to overcome this over-smoothing issue. 


30 Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song 


Theoretical understanding of Graph Neural Networks. Rapid algorithmic devel- 
opments of GNNs have aroused a significant amount of interests in theoretical anal- 
ysis on the expressive power of GNNs. In particular, much efforts have been made 
in order to characterize the expressive power of GNNs when compared with the tra- 
ditional graph algorithms (e.g. graph kernel-based methods) and how to build more 
powerful GNNs so as to overcome several limitations in GNNs. Specifically, Xu 
et al (2019d) showed that current GNN methods are able to achieve the expressive 
power of the 1-dimensional Weisfeiler-Lehman test (Weisfeiler and Leman., 1968), 
a widely used method in traditional graph kernel community (Shervashidze et al, 
2011b). Much recent research has further proposed a series of design strategies in 
order to further reach beyond the expressive power of the Weisfeiler-Lehman test by 
including attaching random attributes, distance attributes, and utilizing higher-order 
structures. 

Scalability of Graph Neural Networks. The increasing popularity of GNNs have 
attracted many attempts to apply various GNN methods on real-world applications, 
where the graph sizes are often about having one hundred million nodes and one 
billion edges. Unfortunately, most of the GNN methods cannot directly be applied 
on these large-scale graph-structured data due to large memory requirements (Hu 
et al, 2020b). Specifically, this is because the majority of GNNs are required to 
store the whole adjacent matrices and the intermediate feature matrices in the mem- 
ory, rendering the significant challenges for both computer memory consumption 
and computational costs. In order to address these issues, many recent works have 
been proposed with various sampling strategies such as node-wise sampling (Hamil- 
ton et al, 2017b; Chen et al, 2018d), layer-wise sampling (Chen and Bansal, 2018; 
Huang et al, 2018), and graph-wise sampling (Chiang et al, 2019; Zeng et al, 2020a). 

Interpretability of Graph Neural Networks. Explainable artificial intelligence are 
becoming increasingly popular in providing interpretable results on machine learn- 
ing process, especially due to the black-box issue of deep learning techniques. As a 
result, there is a surge of interests in improving the interpretability of GNNs. Gener- 
ally speaking, explanation results on GNNs could be important nodes, important 
edges, or important features of nodes or edges. Technically, white-box approxi- 
mation based methods (Baldassarre and Azizpour, 2019; Sanchez-Lengeling et al, 
2020) utilize the information inside the model inlucidng gradients, intermediate fea- 
tures, and model parameters to provide the explanation. In contrast, the black-box 
approximation based methods (Huang et al, 2020c; Zhang et al, 2020a; Vu and Thai, 
2020) abandon the utilization of internal information of complex models but instead 
leverage the intrinsically interpretable simple models (e.g. linear regression and de- 
cision trees) to fit the complex models. However, most of the existing works are 
time-consuming, which rendering the difficulty in coping with large-scale graph. 
To this end, many recent efforts have been made in order to develop more efficient 
approaches without compromising the explanation accuracy. 

Adversarial robustness of Graph Neural Networks. Trustworthy machine learn- 
ing has recently attracted a significant amount of attention since the existing studies 
have shown that deep learning models could be deliberately fooled, evaded, misled, 
and stolen (Goodfellow et al, 2015). Consequently, a line of research has exten- 
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sively studied the robustness of models in domains like computer vision and natural 
language processing, which has also influenced similar research on the robustness 
of GNNs. Technically, the standard approach (via adversarial examples) for study- 
ing the robustness of GNNs is to construct a small change of the input graph data 
and then to observe if it leads to a large change of the prediction results (i.e. node 
classification accuracy). There are a growing number of research works toward ei- 
ther adversarial attacks (Dai et al, 2018a; Wang and Gong, 2019; Wu et al, 2019b; 
Ziigner et al, 2018; Ziigner et al, 2020) or adversarial training (Xu et al, 2019c; Feng 
et al, 2019b; Chen et al, 20201; Jin and Zhang, 2019). Many recent efforts have been 
made to provide both theoretical guarantees and new algorithmic developments in 
adversarial training and certified robustness. 


3.2.2 Graph Neural Networks: Frontiers 


Built on these aforementioned fundamental techniques of GNNs, there are various 
fast-growing recent research developments in coping with a variety of graph-related 
research problems. In this section, we will comprehensively introduce these research 
frontiers that are either long-standing graph learning problems with new GNN solu- 
tions or recently emerging learning problems with GNNs. 

Graph Neural Networks: Graph Classification and Link Prediction. Since each 
layer in GNN models only produce the node-level representations, graph pooling 
layers are needed to further compute graph-level representation based on node-level 
representations. The graph-level representation, which summarizes the key charac- 
teristics of input graph-structure, is the critical component for the graph classifica- 
tion. Depending on the learning techniques of graph pooling layers, these methods 
can be generally categorized into four groups: simple flat-pooling (Duvenaud et al, 
2015a; Mesquita et al, 2020), attention-based pooling (Lee et al, 2019d; Huang et al, 
2019d), cluster-based pooling (Ying et al, 2018c), and other type of pooling (Zhang 
et al, 2018f; Bianchi et al, 2020; Morris et al, 2020b). Beside graph classification, 
another long-standing graph learning problem is link prediction task, which aims to 
predict missing or future links between any pair of nodes. Since GNNs can jointly 
learn from both graph structure and side information (e.g. node and edge features), 
it has shown great advantages over other conventional graph learning methods for 
link prediction. Regarding the learning types of link prediction, node-based methods 
(Kipf and Welling, 2016) and subgraph-based methods (Zhang and Chen, 2018a, 
2020) are two popular groups of GNN based methods. 

Graph Neural Networks: Graph Generation and Graph Transformation. Graph 
generation problem that builds probabilistic models over graphs is a classical re- 
search problem that lies at the intersection between the probability theory and the 
graph theory. Recent years have seen an increasing amount of interest in develop- 
ing deep graph generative models that are built on modern deep learning on graphs 
techniques like GNNs. These deep models have proven to be a more successful ap- 
proach in capturing the complex dependencies within the graph data and generating 
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more realistic graphs. Encouraged by the great successes of Variational AutoEn- 
coder (VAE) (Kingma and Welling, 2013) and Generative Adversarial Networks 
(Goodfellow et al, 2014a) (Goodfellow et al, 2014b), there are three representa- 
tive GNN based learning paradigms for graph generation including Graph VAE ap- 
proaches (Jin et al, 2018b; Simonovsky and Komodakis, 2018; Grover et al, 2019), 
GraphGAN approaches (De Cao and Kipf, 2018; You et al, 2018a) and Deep Au- 
toregressive methods (Li et al, 2018d; You et al, 2018b; Liao et al, 2019a). Graph 
transformation problem can be formulated as a conditional graph generation prob- 
lem, where its goal is to learn a translation mapping between the input source graph 
and the output target graph (Guo et al, 2018b). Such learning problem often arises 
in other domains such as machine translation problem in Natural Language Pro- 
cessing domain and image style transfer in computer Vision domain. Depending on 
what graph information is transformed, this problem can be generally grouped into 
four categories including node-level transformation (Battaglia et al, 2016; Yu et al, 
2018a; Li et al, 2018e), edge-level transformation (Guo et al, 2018b; Zhu et al, 2017; 
Do et al, 2019), node-edge co-transformation (Maziarka et al, 2020a; Kaluza et al, 
2018; Guo et al, 2019c), and graph-involved transformation (Bastings et al, 2017; 
Xu et al, 2018c; Li et al, 2020f). 

Graph Neural Networks: Graph Matching and Graph Structure Learning. The 
problem of graph matching is to find the correspondence between two input graphs, 
which is an extensively studied problem in a variety of research fields. Conven- 
tionally, the graph matching problem is known to be NP-hard (Loiola et al, 2007), 
rendering this problem computationally infeasible for exact and optimum solutions 
for real-world large-scale problems. Due to the expressive power of GNNs, there 
is an increasing attention on developing various graph matching methods based on 
GNNs in order to improve the matching accuracy and efficiency (Zanfir and Smin- 
chisescu, 2018; Rolinek et al, 2020; Li et al, 2019h; Ling et al, 2020). Graph match- 
ing problem aims to measure the similarity between two graph structures without 
changing them. In contrast, graph structure learning aims to produce an optimized 
graph structure by jointly learning implicit graph structure and graph node repre- 
sentation (Chen et al, 2020m; Franceschi et al, 2019; Velickovic et al, 2020). The 
learnt graph structure often can be treated as a shift compared to the intrinsic graph 
which is often noisy or incomplete. Graph structure learning can also be used when 
the initial graph is not provided while the data matrix shows correlation among data 
points. 

Dynamic Graph Neural Networks and Heterogeneous Graph Neural Networks. 
In real-world applications, the graph nodes (entities) and the graph edges (relations) 
are often evolving over time, which naturally gives rise to dynamic graphs. Unfor- 
tunately, various GNNs cannot be directly applied to the dynamic graphs, where 
modeling the evolution of the graph is critical in making accurate predictions. A 
simple yet often effective approach is converting dynamic graphs into static graphs, 
leading to potential loss of information. Regarding the type of dynamic graphs, there 
are two major categories of GNN-based methods, including GNNs for discrete-time 
dynamic graphs (Seo et al, 2018; Manessi et al, 2020) and GNNs for continue-time 
dynamic graphs (Kazemi et al, 2019; Xu et al, 2020a). Independently, another pop- 
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ular graph type in real applications is heterogeneous graphs that consist of different 
types of graph nodes and edges. To fully exploit this information in heterogeneous 
graphs, different GNNs for homogeneous graphs are not applicable. As a result, a 
new line of research has been devoted to developing various heterogeneous graph 
neural networks including message passing based methods (Wang et al, 20191; Fu 
et al, 2020; Hong et al, 2020b), encoder-decoder based methods (Tu et al, 2018; 
Zhang et al, 2019b), and adversarial based methods (Wang et al, 2018a; Hu et al, 
2018a). 

Graph Neural Networks: AutoML and Self-supervised Learning. Automated ma- 
chine learning (AutoML) has recently drawn a significant amount of attention in 
both research and industrial communities, the goal of which is coping with the 
huge challenge of time-consuming manual tuning process, especially for compli- 
cated deep learning models. This wave of the research in AutoML also influences 
the research efforts in automatically identifying an optimized GNN model architec- 
ture and training hyperparameters. Most of the existing research focuses on either 
architecture search space (Gao et al, 2020b; Zhou et al, 2019a) or training hyperpa- 
rameter search space (You et al, 2020a; Shi et al, 2020). Another important research 
direction of GNNs is to address the limitation of most of deep learning models 
that requires large amount of annotated data. As a result, self-supervised learning 
has been proposed which aims to design and leverage domain-specific pretext tasks 
on unlabeled data to pretrain a GNN model. In order to study the power of serf- 
supervised leanring in GNNs, there are quite a few works that systemmatically de- 
sign and compare different self-supervised pretext tasks in GNNs (Hu et al, 2020c; 
Jin et al, 2020d; You et al, 2020c). 


3.2.3 Graph Neural Networks: Applications 


Due to the power of GNNs to model various data with complex structures, GNNs 
have been widely applied into many applications and domains, such as modern rec- 
ommender systems, computer vision (CV), natural language processing (NLP), pro- 
gram analysis, software mining, bioinformatics, anomaly detection, and urban intel- 
ligence. Though GNNS are utilized to solve different tasks for different applications, 
they all consist of two important steps, namely graph construction and graph repre- 
sentation learning. Graph construction aims to first transform or represent the input 
data as graph-structured data. Based on the graphs, graph representation learning 
utilizes GNNs to learn the node or graph embeddings for the downstream tasks. 
In the following, we briefly introduce the techniques of these two steps regarding 
different applications. 
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3.2.3.1 Graph Construction 


Graph construction is important in capturing the dependency among the objects in 
the input data. Given the various formats of input data, different applications have 
different graph construction techniques, while some tasks need to pre-define the 
semantic meaning of nodes and edges to fully express the structural information of 
the input data. 

Input Data with Explicit Graph Structures. Some applications naturally have 
the structure inside the data without pre-defined nodes and the edges/relationships 
among them. For example, the user-item interactions in a recommender systems nat- 
urally form a graph where user-item preference is regarded as the edges between the 
nodes of user and item. In the task of drug design, a molecule is also naturally rep- 
resented as a graph, where each node denotes an atom and an edge denotes a bond 
that connects two atoms. In the task of protein function prediction and interaction, 
the graph can also easily fit into a protein, where each amino-acid refers to a node 
and each edge refers to the interaction among amino-acids. 

Some graphs are constructed with the node and edge attributes. For example, in 
dealing with the transportation in the urban intelligence, the traffic networks can be 
formalized as an undirected graph to predict the traffic state. Specifically, the nodes 
are the traffic sensing locations, e.g., sensor stations, road segments, and the edges 
are the intersections or road segments connecting those traffic sensing locations. 
Some urban traffic network can be modeled as a directed graph with attributes to 
predict the traffic speed, where the nodes are the road segments, and the edges are 
the intersections. Road segment width, length, and direction are the attributes of the 
nodes, and the type of intersection, and whether there are traffic lights, toll gates are 
the attributes of edges. 

Input Data with Implicit Graph Structures. For many tasks that do not naturally 
involve a structured data, graph construction becomes very challenging. It is impor- 
tant to choose the best representation so that the nodes and edges can capture all the 
important things. For example, in computer vision (CV) tasks, there are three kinds 
of graph construction. The first is to split the image or the frame of the video into 
regular grids, and each grid serves as a vertex of the visual graph. The second way 
is to first get the preprocessed structures which can be directly borrowed for vertex 
representation, such as the formulation of scene graphs. The last one is about utiliz- 
ing semantic information to represent visual vertexes, such as assigning pixels with 
similar features to the same vertex. The edges in the visual images can capture two 
kinds of information. One is spatial information. For example, for static methods, 
generating scene graphs (Xu et al, 2017a) and human skeletons (Jain et al, 201 6a) is 
natural to choose edges between nodes in the visual graph to represent their location 
connection. Another is temporal information. For example, to represent the video, 
the model not only builds spatial relations in a frame but also captures temporal 
connections among adjacent frames. 

In the natural language processing (NLP) tasks, the graph construction from 
the text data can be categorized into five categories: text graphs, syntactic graphs, 
semantic graphs, knowledge graphs, and hybrid graphs. Text graphs normally re- 
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gard words, sentences, paragraphs, or documents as nodes and establish edges by 
word co-occurrence, location, or text similarities. Syntactic graphs (or trees) empha- 
size the syntactical dependencies between words in a sentence, such as dependency 
graph and constituency graph. Knowledge graphs (KGs) are graphs of data intended 
to accumulate and convey knowledge of the real world. Hybrid graphs contain mul- 
tiple types of nodes and edges to integrate heterogeneous information. In the task of 
program analysis, the formulation over graph representations of programs includes 
syntax trees, control flow, data flow, program dependence, and call graphs, each pro- 
viding different views of a program. At a high level, programs can be thought as a 
set of heterogeneous entities that are related through various kinds of relations. This 
view directly maps a program to a heterogeneous directed graph, with each entity 
being represented as a node and each relationship of type represented as an edge. 


3.2.3.2 Graph Representation Learning 


After getting the graph expression of the input data, the next step is applying 
GNNs for learning the graph representations. Some works directly utilize the typical 
GNNs, such as GCN (Kipf and Welling, 2017a), GAT (Petar et al, 2018), GGNN 
(Li et al, 2016a) and GraphSage (Hamilton et al, 2017b), which can be generalized 
to different application tasks. While some special tasks needs an additional design 
on the GNN architecture to better handle the specific problem. For example, in the 
task of recommender systems, PinSage (Ying et al, 2018a) is proposed which takes 
the top-k counted nodes of a node as its receptive field and utilizes weighted ag- 
gregation for aggregation. PinSage can be scalable to the web-scale recommender 
systems with millions of users and items. KGCN (Wang et al, 2019d) aims to en- 
hance the item representation by performing aggregations among its corresponding 
entity neighborhood in a knowledge graph. KGAT (Wang et al, 2019}) shares a gen- 
erally similar idea with KGCN except for incorporating an auxiliary loss for knowl- 
edge graph reconstruction. For instance, in the NLP task of KB-alignment, Xu et al 
(2019e) formulated it as a graph matching problem, and proposed a graph attention- 
based approach. It first matches all entities in two KGs, and then jointly models the 
local matching information to derive a graph-level matching vector. The detailed 
GNN techniques for each application can be found in the following chapters of this 
book. 


3.2.4 Graph Neural Networks: Organization 


The high-level organization of the book is demonstrated in Figure 1.3. The book is 
organized into four parts to best accommodate a variety of readers. Part I introduces 
basic concepts; Part II discusses the most established methods; Part III presents the 
most typical frontiers, and Part IV describes advances of methods and applications 
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that tend to be important and promising for future research. Next, we briefly elabo- 
rate on each chapter. 


e Part I: Introduction. These chapters provide the general introduction from the 
representation learning for different data types, to the graph representation 
learning. In addition, it introduces the basic ideas and typical variants of graph 
neural networks for the graph representation learning. 

Part IT: Foundations. These chapters describe the foundations of the graph neu- 
ral networks by introducing the properties of graph neural networks as well as 
several fundamental problems in this line. Specifically, this part introduces the 
fundamental problems in graphs: node classification, the expressive power of 
graph neural networks, the interpretability and scalability issues of graph neu- 
ral network, and the adversarial robustness of the graph neural networks. 

Part III: Frontiers. In these chapters, some frontier or advanced problems in 
the domain of graph neural networks are proposed. Specifically, there are in- 
troductions about the techniques in graph classification, link prediction, graph 
generation, graph transformation, graph matching, graph structure learning. In 
addition, there are also introductions of several variants of GNNs for different 
types of graphs, such as GNNs for dynamic graphs, heterogeneous graphs. We 
also introduce the AutoML and self-supervised learning for GNNs. 

Part IV: Broad and Emerging Applications. These chapters introduce the broad 
and emerging applications with GNNSs. Specifically, these GNNs-based applica- 
tions covers modern recommender systems, tasks in computer vision and NLP, 
program analysis, software mining, biomedical knowledge graph mining for 
drug design, protein function prediction and interaction, anomaly detection, and 
urban intelligence. 


3.3 Summary 


Graph Neural Networks (GNNs) have been emerging rapidly to deal with the graph- 
structured data, which cannot be directly modeled by the conventional deep learning 
techniques that are designed for Euclidean data such as images and text. A wide 
range of applications can be naturally or best represented with graph structure and 
have been successfully handled by various graph neural networks. 

In this chapter, we have systematically introduced the development and overview 
of GNNs, including the introduction of its foundations, frontiers, and applications. 
Specifically, we provide the fundamental aspects of GNNs ranging from the existing 
typical GNN methods and their expressive powers, to the scalability, interpretability 
and robustness of GNNs. These aspects motivate the research on better understand- 
ing and utilization of GNNs. Built on GNNs, recent research developments have 
seen a surge of interests in coping with graph-related research problems, which 
we called frontiers of GNNs. We have discussed various frontier research built on 
GNNs, ranging from graph classification and link prediction, to graph generation, 
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Fig. 3.1: The high-level organization of the book 


transformation, matching and graph structure learning. Due to the power of GNNs 
to model various data with complex structures, GNNs have been widely applied into 
many applications and domains, such as modern recommender systems, computer 
vision, natural language processing, program analysis, software mining, bioinfor- 
matics, anomaly detection, and urban intelligence. Most of these tasks consist of 
two important steps, namely graph construction and graph representation learning. 
Thus, we provide the introduction of the techniques of these two steps regarding 
different applications. The introduction part will end here and thus a summary of 
the organization of this book has been provided at the end of this chapter. 


Part II 
Foundations of Graph Neural Networks 


A 
Chapter 4 om 
Graph Neural Networks for Node Classification 


Jian Tang and Renjie Liao 


Abstract Graph Neural Networks are neural architectures specifically designed for 
graph-structured data, which have been receiving increasing attention recently and 
applied to different domains and applications. In this chapter, we focus on a funda- 
mental task on graphs: node classification. We will give a detailed definition of node 
classification and also introduce some classical approaches such as label propaga- 
tion. Afterwards, we will introduce a few representative architectures of graph neu- 
ral networks for node classification. We will further point out the main difficulty— 
the oversmoothing problem—of training deep graph neural networks and present 
some latest advancement along this direction such as continuous graph neural net- 
works. 


4.1 Background and Problem Definition 


Graph-structured data (e.g., social networks, the World Wide Web, and protein- 
protein interaction networks) are ubiquitous in real-world, covering a variety of 
applications. A fundamental task on graphs is node classification, which tries to 
classify the nodes into a few predefined categories. For example, in social networks, 
we want to predict the political bias of each user; in protein-protein interaction net- 
works, we are interested in predicting the function role of each protein; in the World 
Wide Web, we may have to classify web pages into different semantic categories. 
To make effective prediction, a critical problem is to have very effective node rep- 
resentations, which largely determine the performance of node classification. 
Graph neural networks are neural network architectures specifically designed for 
learning representations of graph-structured data including learning node represen- 
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tations of big graphs (e.g., social networks and the World Wide Web) and learning 
representations of entire graphs (e.g., molecular graphs). In this chapter, we will 
focus on learning node representations for large-scale graphs and will introduce 
learning the whole-graph representations in other chapters. A variety of graph neu- 
ral networks have been proposed (Kipf and Welling, 2017b; Veličković et al, 2018; 
Gilmer et al, 2017; Xhonneux et al, 2020; Liao et al, 2019b; Kipf and Welling, 
2016; Veličković et al, 2019). In this chapter, we will comprehensively revisit exist- 
ing graph neural networks for node classification including supervised approaches 
(Sec. 4.2), unsupervised approaches (Sec. 4.3), and a common problem of graph 
neural networks for node classification—over-smoothing (Sec. 4.4). 


Problem Definition. Let us first formally define the problem of learning node rep- 
resentations for node classification with graph neural networks. Let Y = (V,&) 
denotes a graph, where Y is the set of nodes and £ is the set of edges. A € RN*™ rep- 
resents the adjacency matrix, where N is the total number of nodes, and X € RY*© 
represents the node attribute matrix, where C is the number of features for each 
node. The goal of graph neural networks is to learn effective node representations 
(denoted as H € R‘**",, F is the dimension of node representations) by combining 
the graph structure information and the node attributes, which are further used for 
node classification. 


Table 4.1: Notations used throughout this chapter. 


Concept Notation 
Graph G=(V,é) 
Adjacency matrix A E RYN 
Node attributes X ERC 
Total number of GNN layers K 
Node representations at the k-th layer|H* € RYT, k € {1,2,--- ,K} 


4.2 Supervised Graph Neural Networks 


In this section, we revisit several representative methods of graph neural networks 
for node classification. We will focus on the supervised methods and introduce the 
unsupervised methods in the next section. We will start by introducing a general 
framework of graph neural networks and then introduce different variants under this 
framework. 
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4.2.1 General Framework of Graph Neural Networks 


The essential idea of graph neural networks is to iteratively update the node repre- 
sentations by combining the representations of their neighbors and their own repre- 
sentations. In this section, we introduce a general framework of graph neural net- 
works in (Xu et al, 2019d). Starting from the initial node representation H? = X, in 
each layer we have two important functions: 


e AGGREGATE, which tries to aggregate the information from the neighbors of 
each node; 

e COMBINE, which tries to update the node representations by combining the 
aggregated information from neighbors with the current node representations. 


Mathematically, we can define the general framework of graph neural networks 
as follows: 


Initialization: H? = X 
For k= 1,2,--- ,K, 


a‘ = AGGREGATE! {H™! : u € N(v)} (4.1) 
H* = COMBINE‘ {H*! a$}, (4.2) 


where N(v) is the set of neighbors for the v-th node. The node representations H* 
in the last layer can be treated as the final node representations. 

Once we have the node representations, they can be used for downstream tasks. 
Take the node classification as an example, the label of node v (denoted as ¥,,) can 
be predicted through a Softmax function, i.e., 


Sy = Softmax(WH,! ), (4.3) 


where W € R'*!*", | Y| is the number of labels in the output space. 
Given a set of labeled nodes, the whole model can be trained by minimizing the 
following loss function: 


Ji x 
oa $ loss(¥i,yi), (4.4) 
i=1 


where y; is the ground truth label of node i, n; is the number of labeled nodes, 
loss(-,-) is a loss function such as cross-entropy loss function. The whole neural 
networks can be optimized by minimizing the objective function O with backprop- 
agation. 

Above we present a general framework of graph neural networks. Next, we will 
introduce a few most representative instantiations or variants of graph neural net- 
works in the literature. 
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4.2.2 Graph Convolutional Networks 


We will start from the graph convolutional networks (GCN) (Kipf and Welling, 
2017b), which is now the most popular graph neural network architecture due to its 
simplicity and effectiveness in a variety of tasks and applications. Specifically, the 
node representations in each layer is updated according to the following propagation 
rule: 


H+! = 6 (D-2AD-2 HW"). (4.5) 


A =A +I is the adjacency matrix of the given undirected graph Y with self- 
connections, which allows to incorporate the node features itself when updating the 
node representations. I € RY*™ is the identity matrix. D is a diagonal matrix with 
Di =¥, jAi j o(-) is an activation function such as ReLU and Tanh. The ReLU ac- 


tive function is widely used, which is defined as RELU(x) = max(0,x). W* € RFX? 
(F,F’ are the dimensions of node representations in the k-th, (k+1)-th layer respec- 
tively) is a laywise linear transformation matrix, which will be trained during the 
optimization. 

We can further dissect equation equation 4.5 and understand the AGGREGATE 
and COMBINE function defined in GCN. For a node i, the node updating equation 
can be reformulated as below: 


A; 
k i e 
Hf =0o( }, —=—ar'w') (4.6) 
JE{N(i)Ui} DD jj 


Hf=0( Y A 
JENG) \/ DD jj 
In the Equation equation 4.7, we can see that the AGGREGATE function is de- 
fined as the weighted average of the neighbor node representations. The weight of 
the neighbor j is determined by the weight of the edge between i and j (i.e. A;; nor- 
malized by the degrees of the two nodes). The COMBINE function is defined as the 
summation of the aggregated messages and the node representation itself, in which 
the node representation is normalized by its own degree. 


1 
Hy wk + ai W’) (4.7) 
i 


Connections with Spectral Graph Convolutions. Next, we discuss the connec- 
tions between GCNs and traditional spectral filters defined on graphs (Defferrard 
et al, 2016). The spectral convolutions on graphs can be defined as a multiplication 
of a node-wise signal x € RY with a convolutional filter gg = diag(@) (0 € R is 
the parameter of the filter) in the Fourier domain. Mathematically, 


go xx =UgeU' x. (4.8) 
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U represents the matrix of the eigenvectors of the normalized graph Laplacian ma- 
trix L = Iy — D-2AD-2, L = UAUT, A is a diagonal matrix of eigenvalues, and 
U’x is the graph Fourier transform of the input signal x. In practice, gg can be un- 
derstood as a function of the eigenvalues of the normalized graph Laplacian matrix 
L (i.e. gg(A)). In practice, directly calculating Eqn. equation 4.8 is very compu- 
tationally expensive, which is quadratic to the number of nodes N. According to 
(Hammond et al, 2011), this problem can be circumvented by approximating the 
function gg(A) with a truncated expansion of Chebyshev polynomials T(x) up to 
K"" order: 


K 
8e(A) = x 0,7;(A), (4.9) 


where A = x -A= I, and Anax is the largest eigenvalue of L. 0’ € RŽ is the vector 
of Chebyshev coefficients. T(x) are Chebyshev polynomials which are recursively 
defined as T;,(x) = 2xT,_1 (x) — Ty_2(x), with T(x) = 1 and Tı (x) =x. By combining 
Eqn. equation 4.9 and Eqn. equation 4.8, the convolution of a signal x with a filter 
gg can be reformulated as below: 


K 
Se xX=) OT (L)x, (4.10) 
k=0 


where L = -L —I. From this equation, we can see that each node only depends 
maxX 


on the information within the K“’-order neighborhood. The overall complexity of 
evaluating Eqn. equation 4.10 is @(|@|) (ie. linear to the number of edges in the 
original graph Y), which is very efficient. 

To define a neural network based on graph convolutions, one can stack multiple 
convolution layers defined according to Eqn. equation 4.10 with each layer followed 
by a nonlinear transformation. At each layer, instead of being limited to the explicit 
parametrization by the Chebyshev polynomials defined in Eqn. equation 4.10, the 
authors of GCNs proposed to limit the number of convolutions to K = 1 at each 
layer. By doing this, at each layer, it only defines a linear function over the graph 
Laplacian matrix L. However, by stacking multiple such layers, we are still capable 
of covering a rich class of convolution filter functions on graphs. Intuitively, such a 
model is capable of alleviating the problem of overfitting local neighborhood struc- 
tures for graphs whose node degree distribution has a high variance such as social 
networks, the World Wide Web, and citation networks. 

At each layer, we can further approximate Aja, œ 2, which could be accommo- 
dated by the neural network parameters during training. Based on al these simplifi- 
cations, we have 


gor ex ~ Olx + O/x(L — Iy)x = Olx — 0| D72AD?, (4.11) 


where 0% and oi are too free parameters, which could be shared over the entire 
graph. In practice, we can further reduce the number of parameters, which allows to 
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reduce overfitting and meanwhile minimize the number of operations per layer. As 
a result, the following expression can be further obtained: 


goxx © O(1+D-2AD~2)x, (4.12) 


where 0 = o% = —6). One potential issue is the matrix Iy + D7 3AD~?, whose 
eigenvalues lie in the interval of [0,2]. In a deep graph convolutional neural network, 
repeated application of the above function will likely lead to exploding or vanish- 
ing gradients, yielding numerical instabilites. As a result, we can further renormal- 
ize this matrix by converting I+ D 2AD~? to D-2AD 2, where A = A + I, and 
Di =} Aig 

In the above, we only consider the case that there is only one feature channel 
and one filter. This can be easily generalized to an input signal with C channels 
X € RY*C and F filters (or number of hidden units) as follows: 


H = D- 2ÃĎ IXW, (4.13) 


where W € RO*F is a matrix of filter parameters. H is the convolved signal matrix. 


4.2.3 Graph Attention Networks 


In GCNs, for a target node i, the importance of a neighbor j is determined by the 
weight of their edge A;; (normalized by their node degrees). However, in practice, 
the input graph may be noisy. The edge weights may not be able to reflect the true 
strength between two nodes. As a result, a more principled approach would be to au- 
tomatically learn the importance of each neighbor. Graph Attention Networks (a.k.a. 
GAT(Velickovié et al, 2018)) is built on this idea and try to learn the importance of 
each neighbor based on the Attention mechanism (Bahdanau et al, 2015; Vaswani 
et al, 2017). Attention mechanism has been wide used in a variety of tasks in nat- 
ural language understanding (e.g. machine translation and question answering) and 
computer vision (e.g. visual question answering and image captioning). Next, we 
will introduce how attention is used in graph neural networks. 


Graph Attention Layer. The graph attention layer defines how to transfer the hid- 
den node representations at layer k — 1 (denoted as H‘~! € R%**) to the new node 
representations H% € RY** In order to guarantee sufficient expressive power to 
transform the lower-level node representations to higher-level node representations, 
a shared linear transformation is applied to every node, denoted as W € R*** " Af- 
terwards, self-attention is defined on the nodes, which measures the attention coeffi- 
cients for any pair of nodes through a shared attentional mechanism a: R” ‘x RF —> 
R 


eij =a(WH;',WH'"'). (4.14) 


4 Graph Neural Networks for Node Classification 47 


eij indicates the relationship strength between node i and j. Note in this subsec- 
tion we use HE! to represent a column-wise vector instead of a row-wise vector. 
For each node, we can theoretically allow it to attend to every other node on the 
graph, which however will ignore the graph structural information. A more reason- 
able solution would be only to attend to the neighbors for each node. In practice, 
the first-order neighbors are only used (including the node itself). And to make the 
coefficients comparable across different nodes, the attention coefficients are usually 
normalized with the softmax function: 


exp(eij) 


Qi; = Softmax ;(4e;; }) = —————_.. 
i iMeu}) Liena XP (ei) 


(4.15) 
We can see that for a node i, a; essentially defines a multinomial distribution over 
the neighbors, which can also be interpreted as the transition probability from node 
i to each of its neighbors. 

In the work by Veličković et al (2018), the attention mechanism a is defined as 
a single-layer feedforward neural network including a linear transformation with 
the weight vector W, € R!*?F ^ and a LeakyReLU nonlinear activation function 
(with negative input slope œ = 0.2). More specifically, we can calculate the attention 
coefficients with the following architecture: 


exp(LeakyReLU(W2[WH* || IWH} 1)) 
Ereng exp(LeakyReLU(W2[WH*|||WHE!)))’ 


Qij (4.16) 
where || represents the operation of concatenating two vectors. The new node rep- 
resentation is a linear combination of the neighboring node representations with the 
weights determined by the attention coefficients (with a potential nonlinear trans- 
formation), i.e. 


Hk=o| Y away). (4.17) 
JEN) 


Multi-head Attention. 

In practice, instead of only using one single attention mechanism, multi-head at- 
tention can be used, each of which determines a different similarity function over 
the nodes. For each attention head, we can independently obtain a new node rep- 
resentation according to Eqn. equation 4.17. The final node representation will be 
a concatenation of the node representations learned by different attention heads. 
Mathematically, we have 


o| $, owas |, (4.18) 
=i AJENG) 
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where T is the total number of attention heads, of is the attention coefficient calcu- 
lated from the t-th attention head, W’ is the linear transformation matrix of the t-th 
attention head. 

One thing that mentioned in the paper by Veličković et al (2018) is that in the 
final layer, when trying to combine the node representations from different attention 
heads, instead of using the operation concatenation, other pooling techniques could 
be used, e.g. simply taking the average node representations from different attention 
heads. 


17 
k k-1 
H; -o( J ) of WH; : (4.19) 


t=1 jEN(i) 


4.2.4 Neural Message Passing Networks 


Another very popular graph neural network architecture is the Neural Message Pass- 
ing Network (MPNN) (Gilmer et al, 2017), which is originally proposed for learn- 
ing molecular graph representations. However, MPNN is actually very general, pro- 
vides a general framework of graph neural networks, and could be used for the task 
of node classification as well. The essential idea of MPNN is formulating existing 
graph neural networks as a general framework of neural message passing among 
nodes. In MPNNs, there are two important functions including Message and Up- 
dating function: 


m= $, M,(Hi"', Hi", eij), (4.20) 
iE€N(j) 
HÉ =U, (H* | mf). (4.21) 


M,(-,-,-) defines the message between node i and j in the k-th layer, which depends 
on the two node representations and the information of their edge. U; is the node 
updating function in the k-th layer which combines the aggregated messages from 
the neighbors and the node representation itself. We can see that the MPNN frame- 
work is very similar to the general framework we introduced in Section 4.2.1. The 
AGGREGATE function defined here is simply a summation of all the messages 
from the neighbors. The COMBINE function is the same as the node Updating 
function. 


4.2.5 Continuous Graph Neural Networks 


The above graph neural networks iteratively update the node representations with 
different kinds of graph convolutional layers. Essentially, these approaches model 
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the discrete dynamics of node representations with GNNs. Xhonneux et al (2020) 
proposed the continuous graph neural networks (CGNNs), which generalizes exist- 
ing graph neural networks with discrete dynamics to continuous settings, i.e., trying 
to model the continuous dynamics of node representations. The key idea is how to 
characterize the continuous dynamics of node representations, i.e. the derivatives of 
node representation w.r.t. time. The CGNN model is inspired by the diffusion-based 
models on graphs such as PageRank and epidemic models on social networks. The 
derivatives of the node representations are defined as a combination of the node 
representation itself, the representations of its neighbors, and the initial status of the 
nodes. Specifically, two different variants of node dynamics are introduced. The first 
model assumes that different dimensions of node presentations (a.k.a. feature chan- 
nels) are independent; the second model is more flexible, which allows different 
feature channels to interact with each other. Next, we give a detailed introduction to 
each of the two models. 

Note: in this part, instead of using the original adjacency matrix A, we use the fol- 
lowing regularized matrix for characterizing the graph structure: 


a 
As (1+D-34D~}) (4.22) 
where œ € (0,1) is a hyperparameter. D is the degree matrix of the original adja- 
cency matrix A. With the new regularized adjacency matrix A, the eigenvalues of A 
will lie in the interval [0, œ], which will make A* converges to 0 when we increase 
the power of k. 


Model 1: Independent Feature Channels. As different nodes in a graph are inter- 
connected, a natural solution to model the dynamic of each feature channel should 
be taking the graph structure into consideration, which allows the information to 
propagate across different nodes. We are motivated by existing diffusion-based 
methods on graphs such as PageRank (Page et al, 1999) and label propagation (Zhou 
et al, 2004), which defines the discrete propagation of node representations (or sig- 
nals on nodes) with the following step-wise propagation equations: 


H+! = AH* +H°, (4.23) 


where H? = X or the output of an encoder on the input feature X. Intuitively, at each 
step, the new node representation is a linear combination of its neighboring node 
representations as well as the initial node features. Such a mechanism allows to 
model the information propagation on the graph without forgetting the initial node 
features. We can unroll Eqn. equation 4.23 and explicitly derive the node represen- 
tations at the k-th step: 


k 
H* = (£a) H? = (A-1) 1 (A! —1)A°. (4.24) 
i=0 


As the above equation effectively models the discrete dynamics of node repre- 
sentations, the CGNN model further extended it to the continuous setting, which 
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replaces the discrete time step k to a continuous variable t € Ro . Specifically, it 
has been shown that Eqn. equation 4.24 is a discretization of the following ordinary 
differential equation (ODE): 


t 


H 
7 logAH' +X, (4.25) 


with the initial value H? = (logA)~!(A —/)X, where X is the initial node features or 
the output of an encoder applied to it. We do not provide the proof here. More details 
can be referred to the original paper (Xhonneux et al, 2020). In Eqn. equation 4.25, 
as logA is intractable to compute in practice, it is approximated with the first-order 
of the Taylor expansion, i.e. logA ~ A — I. By integrating all these information, we 
have the following ODE equation: 
t 
a =(A-I)H' +X, (4.26) 
dt 
with the initial value H? = X, which is the first variant of the CGNN model. 

The CGNN model is actually very intuitive, which has a nice connection with 
traditional epidemic model, which aims at studying the dynamics of infection in a 
population. For the epidemic model, it usually assumes that the infection of people 
will be affected by three different factors including the infection from neighbors, the 
natural recovery, and the natural characteristics of people. If we treat H’ as the num- 
ber of people infected at time ż, then these three factors can be naturally modeled by 
the three terms in Eqn. equation 4.26: AH" for the infection from neighbors, —H' 
for the natural recovery, and the last one X for the natural characteristics of people. 


Model 2: Modeling the Interaction of Feature Channels. The above model as- 
sumes different node feature channels are independent with each other, which is a 
very strong assumption and limits the capacity of the model. Inspired by the success 
of a linear variant of graph neural networks (i.e., Simple GCN (Wu et al, 2019a)), 
a more powerful discrete node dynamic model is proposed, which allows different 
feature channels to interact with each other as, 


H+! = An w +h”, (4.27) 


where W € R“** is a weight matrix used to model the interactions between different 
feature channels. Similarly, we can also extend the above discrete dynamics into 
continuous case, yielding the following equation: 


=(A-IH'+H'(W-1)+X, (4.28) 


dt 


with the initial value being H? = X. This is the second variant of CGNN with train- 
able weights. Similar form of ODEs defined in Eqn. equation 4.28 has been studied 
in the literature of control theory, which is known as Sylvester differential equa- 
tion (Locatelli and Sieniutycz, 2002). The two matrices A — I and W — I characterize 
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the natural solution of the system while X is the information provided to the system 
to drive the system into the desired state. 


Discussion. The proposed continuous graph neural networks (CGNN) has multiple 
nice properties: (1) Recent work has shown that if we increase the number of layers 
K in the discrete graph neural networks, the learned node representations tend to 
have the problem of over-smoothing (will introduce in detail later) and hence lose 
the power of expressiveness. On the contrary, the continuous graph neural networks 
are able to train very deep graph neural networks and are experimentally robust to 
arbitrarily chosen integration time; (2) For some of the tasks on graphs, it is crit- 
ical to model the long-range dependency between nodes, which requires training 
deep GNNs. Existing discrete GNNs fail to train very deep GNNs due to the over- 
smoothing problem. The CGNNs are able to effectively model the long-range de- 
pendency between nodes thanks to the stability w.r.t. time. (3) The hyperparameter 
œ is very important, which controls the rate of diffusion. Specifically, it controls the 
rate at which high-order powers of regularized matrix A vanishes. In the work pro- 
posed by (Xhonneux et al, 2020), the authors proposed to learn a different value of 
a for each node, which hence allows to choose the best diffusion rates for different 
nodes. 


4.2.6 Multi-Scale Spectral Graph Convolutional Networks 


Recall the one-layer graph convolution operator used in GCNs (Kipf and Welling, 
2017b) H = LHW, where L = D™ 2AD~2. Here we drop the superscript of the layer 
index to avoid the clash with the notation of the matrix power. There are two main 
issues with this simple graph convolution formulation. First, one such graph convo- 
lutional layer would only propagate information from any node to its nearest neigh- 
bors, i.e., neighboring nodes that are one-hop away. If one would like to propagate 
information to M-hop away neighbors, one has to either stack M graph convolutional 
layers or compute the graph convolution with M-th power of the graph Laplacian, 
i.e., H = o (LHW ). When M is large, the solution of stacking layers would make 
the whole GCN model very deep, thus causing problems in learning like the van- 
ishing gradient. This is similar to what people experienced in training very deep 
feedforward neural networks. For the matrix power solution, naively computing the 
M-th power of the graph Laplacian is also very costly (e.g., the time complexity is 
O(N 3(M -1)) for graphs with N nodes). Second, there are no learnable parameters 
in GCNs associated with the graph Laplacian L (corresponding to the connectiv- 
ities/structures). The only learnable parameter W is a linear transform applied to 
every node simultaneously which is not aware of the structures. Note that we typ- 
ically associate learnable weights on edges while applying the convolution applied 
to regular graphs like grids (e.g., applying 2D convolution to images). This would 
greatly improve the expressiveness of the model. However, it is not clear that how 
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one can add learnable parameters to the graph Laplacian L since its size varies from 
graph to graph. 


Algorithm 1 : Lanczos Algorithm 


1: Input: S,x,M,¢€ 
2: Initialization: By = 0, qo = 0, and qi = x/||x\| 
3: For j= 1,2,...,K: 
4: z= Sqj 
" ieee le 
5: Yj =4j 2 
6: Z=2Z—jq4; — Bj-14j-1 
7 Bj = |izll2 
8: If B; < €, quit 
9: qj+ı =2/B; 


11: Q=[qi,qQ2,--- ,qu] 

12: Construct T following Eq. (4.29) 
13: Eigen decomposition T = BRB! 
14: Return V = QB and R. =0 


{Mo Ve}g = Lanczos(L) 


Long Range Spectral Filtering 
e.g, I = {20,50,,...} 


K 
i= fhr, rly, ae 
2 k k k kYk 
Hi = o(ÎiXW;) vie [Il] H; = o(ÎiXW;) vie [M] 


Long Range Spectral Filtering 
e.g., 1 = {20, 50,, ...} 


K 

= ite lk Hn T 

s= Men VeVi 
kzi 


D- 


Eee E 
Layer 1 Layer 2 
Fig. 4.1: The inference procedure of Lanczos Networks. The approximated top 
eigenvalues {r} and eigenvectors {v} are computed by the Lanczos algorithm. 
Note that this step is only needed once per graph. The long range/scale (top blocks) 
graph convolutions are efficiently computed by the low-rank approximation of the 
graph Laplacian. One can control the ranges (i.e., the exponent of eigenvalues) 
as hyperparameters. Learnable spectral filters are applied to the approximated top 


eigenvalues {r}. The short range/scale (bottom blocks) graph convolution is the 
same as GCNs. Adapted from Figure 1 of (Liao et al, 2019b). 


To overcome these two problems, authors propose Lanczos Networks in (Liao 
et al, 2019b). Given the graph Laplacian matrix L! and node features X, one first 


' Here we assume a symmetric graph Laplacian matrix. If it is non-symmetric (e.g., for directed 
graphs), one can resort to the Arnoldi algorithm. 
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uses the M-step Lanczos algorithm (Lanczos, 1950) (listed in Alg. 1) to compute an 
orthogonal matrix Q and a symmetric tridiagonal matrix T, such that Q'LO =T. 
We denote Q = [q1,--- , qu] where column vector q; is the i-th Lanczos vector. Note 
that M could be much smaller than the number of nodes N. T is illustrated as below, 


yı Bi 


T= ae k . (4.29) 


After obtaining the tridiagonal matrix T, we can compute the Ritz values and Ritz 
vectors which approximate the top eigenvalues and eigenvectors of L by diagonal- 
izing the matrix T as T = BRB', where the K x K diagonal matrix R contains the 
Ritz values and B € R**¥ is an orthogonal matrix. Here top means ranking the 
eigenvalues by their magnitudes in a descending order. This can be implemented 
via the general eigendecomposition or some fast decomposition methods special- 
ized for tridiagonal matrices. Now we have a low rank approximation of the graph 
Laplacian matrix L ~ VRV! , where V = QB. Denoting the column vectors of V as 


{vi,--:, Vm}, we can compute multi-scale graph convolution as 
H =LHW 
py I I T 
L= VP folh Ta Th) Ymm (4.30) 
m=1 
where {I,,--- ,J,} is the set of scale/range parameters which determine how many 


hops (or how far) one would like to propagate the information over the graph. For 
example, one could easily set {71 = 50,/ = 100} (u = 2 in this case) to consider the 
situations of propagating 50 and 100 steps respectively. Note that one only needs to 
compute the scalar power rather than the original matrix power. The overall com- 
plexity of the Lanczos algorithm in our context is O(MN*) which makes the whole 
algorithm much more efficient than naively computing the matrix power. Moreover, 
fo is a learnable spectral filter parameterized by 0 and can be applied to graphs with 
varying sizes since we decouple the graph size and the input size of fg. fg directly 
acts on the graph Laplacian and greatly improves the expressiveness of the model. 
Although Lanczos algorithm provides an efficient way to approximately com- 
pute arbitrary powers of the graph Laplacian, it is still a low-rank approximation 
which may lose certain information (e.g., the high frequency one). To alleviate the 
problem, one can further do vanilla graph convolution with small scale parameters 
like H = LSHW where S could be small integers like 2 or 3. The resultant repre- 
sentation can be concatenated with the one obtained from the longer scale/range 
graph convolution in Eq. (4.30). Relying on the above design, one could add nonlin- 
earities and stack multiple such layers to build a deep graph convolutional network 
(namely Lanczos Networks) just like GCNs. The overall inference procedure of 
Lanczos Networks is shown in Fig. 4.1. This method demonstrates strong empirical 
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performances on a wide variety of tasks/benchmarks including molecular property 
prediction in quantum chemistry and document classification in citation networks. It 
just requires slight modifications to the implementation of the original GCNs. Nev- 
ertheless, if the input graph is extremely large (e.g., some large social network), the 
Lanczos algorithm itself would be a computational bottleneck. How to improve this 
model in such a problem context would be an open question. 

Here we only introduce a few representative architectures of graph neural net- 
works for node classification. There are also many other well-known architectures 
including gated graph neural networks (Li et al, 2016b)—which is mainly designed 
for output sequences—and GraphSAGE (Hamilton et al, 2017b)—which is mainly 
designed for inductive setting of node classification. 


4.3 Unsupervised Graph Neural Networks 


In this section, we review a few representative GNN-based methods for unsuper- 
vised learning on graph-structured data, including variational graph auto-encoders 
(Kipf and Welling, 2016) and deep graph infomax (Veličković et al, 2019). 


4.3.1 Variational Graph Auto-Encoders 


Following variational auto-encoders (VAEs) (Kingma and Welling, 2014; Rezende 
et al, 2014) , variational graph auto-encoders (VGAEs) (Kipf and Welling, 2016) 
provide a framework for unsupervised learning on graph-structured data. In the fol- 
lowing, we first review the model and then discuss its advantages and disadvantages. 


4.3.1.1 Problem Setup 


Suppose we are given an undirected graph Y = (V,&) with N nodes. Each node 
is associated with a node feature/attribute vector. We compactly denote all node 
features as a matrix X € RY*C., The adjacency matrix of the graph is A. We assume 
self-loops are added to the orignal graph ¥ so that the diagonal entries of A are 1. 
This is a convention in graph convolutional networks (GCNs) (Kipf and Welling, 
2017b) and makes the model consider a node’s old representation while updating its 
new representation. We also assume each node is associated with a latent variable 
(the collection of all latent variables is again compactly denoted as a matrix Z € 
I~), We are interested in inferring the latent variables of nodes in the graph and 
decoding the edges. 
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4.3.1.2 Model 


Similar to VAEs, the VGAE model consists of an encoder qg(Z|A,X), a decoder 
pe(A|Z), and a prior p(Z). 

Encoder The goal of the encoder is to learn a distribution of latent variables asso- 
ciated with each node conditioning on the node features X and the adjacency matrix 
A. We could instantiate gg(Z|A,X) as a graph neural network where the learnable 
parameters are @. In particular, VGAE assumes an node-independent encoder as 
below, 


qo(Z|X,A) - [Jot Zi|X,A) (4.31) 
qo (zi|X,A) = ~ (zilu, diag(0o)) (4.32) 
u, o = GCN¢(X,A) (4.33) 


where Z;, 4i, and g; are the i-th rows of the matrices Z, p, and ø respectively. Ba- 
sically, we assume a multivariate Normal distribution with the diagonal covariance 
as the variational approximated distribution of the latent vector per node (i.e., Zi). 
The mean and diagonal covariance are predict by the encoder network, i.e., a GCN 
as described in Section 4.2.2. For example, the original paper uses a two-layer GCN 
as follows, 


p= AHW,, (4.34) 
o = AHW, (4.35) 
H = ReLU(ÃX Wo), (4.36) 


where A = D-3AD~? is the symmetrically normalized adjacency matrix and D is 
the degree matrix. Learnable parameters are thus @ = [W,,, Wo , Wo]. 

Decoder Given sampled latent variables, the decoder aims at predicting the con- 
nectivities among nodes. The original paper adopts a simple dot-product based pre- 
dictor as below, 


p(A|Z) -jf1 AijlZi,z;) (4.37) 
i=1 j= 
p(A;ijlzi,zj) = o (z} zj), (4.38) 


where A;j denotes the (i, j)-th element and ø(-) is the logistic sigmoid function. 
This decoder again assumes conditional independence among all possible edges for 
tractability. Note that there are no learnable parameters associated with this decoder. 
The only way to improve the performance of the decoder is to learn good latent 
representations. 

Prior The prior distributions over the latent variables are simply set to indepen- 
dent zero-mean Gaussians with unit variances, 
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N 


p(Z) = [ [^ (z:10,1). (4.39) 


i=1 


This prior is fixed throughout the learning as what typical VAEs do. 
Objective & Learning To learn the encoder and the decoder, one typically max- 
imize the evidence lower bound (ELBO) as in VAEs, 


Leipo = Eq, zix a) log p(A|Z)] — KL(q9(Z|X,A)||p(Z)), (4.40) 


where KL(q||p) is the Kullback-Leibler divergence between distributions q and p. 
Note that we can not directly maximize the log likelihood since the introduction 
of latent variables Z induces a high-dimensional integral which is intractable. We 
instead maximize the ELBO in Eq. (4.40) which is a lower bound of the log like- 
lihood. However, the first expectation term is again intractable. One often resorts 
to the Monte Carlo estimation by sampling a few Z from the encoder qg(Z|X,A) 
and evaluating the term using the samples. To maximize the objective, one can per- 
form stochastic gradient descent along with the reparameterization trick (Kingma 
and Welling, 2014). Note that the reparameterization trick is necessary since we 
need to back-propagate through the sampling in the aforementioned Monte Carlo 
estimation term to compute the gradient w.r.t. the parameters of the encoder. 


4.3.1.3 Discussion 


The VGAE model is popular in the literature mainly due to its simplicity and good 
empirical performances. For example, since there are no learnable parameters for 
the prior and the decoder, the model is quite light-weight and the learning process 
is fast. Moreover, the VGAE model is versatile in way that once we learned a good 
encoder, i.e., good latent representations, we can use them for predicting edges (, 
link prediction), node attributes, and so on. On the other side, VGAE model is still 
limited in the following ways. First, it can not serve as a good generative model for 
graphs as what VAEs do for images since the decoder is not learnable. One could 
simply design some learnable decoder. However, it is not clear that the goal of learn- 
ing good latent representations and generating graphs with good qualities are always 
well-aligned. More exploration along this direction would be fruitful. Second, the 
independence assumption is exploited for both the encoder and the decoder which 
might be very limited. More structural dependence (e.g., auto-regressive) would be 
desirable to improve the model capacity. Third, as discussed in the original paper, 
the prior may be potentially a poor choice. At last, for link prediction in practice, 
one may need to add the weighting of edges vs. non-edges in the decoder term and 
carefully tune it since graphs may be very sparse. 
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4.3.2 Deep Graph Infomax 


Following Mutual Information Neural Estimation (MINE) (Belghazi et al, 2018) and 
Deep Infomax (Hjelm et al, 2018), Deep Graph Infomax (Veličković et al, 2019) is 
an unsupervised learning framework that learns graph representations via the prin- 
ciple of mutual information maximization. 


4.3.2.1 Problem Setup 


Following the original paper, we will explain the model under the single-graph 
setup, i.e., the node feature matrix X and the graph adjacency matrix A of a single 
graph are provided as input. Extensions to other problem setups like transductive 
and inductive learning settings will be discussed in Section 4.3.2.3. The goal is to 
learn the node representations in an unsupervised way. After node representations 
are learned, one can apply some simple linear (logistic regression) classifier on top 
of the representations to perform supervised tasks like node classification. 


4.3.2.2 Model 


Fig. 4.2: The overall process of Deep Graph Infomax. The top path shows how the 
positive sample is processed, whereas the bottom shows process corresponding to 
the negative sample. Note that the graph representation is shared for both positive 
and negative samples. Subgraphs of positive and negative samples do not necessarily 
need to be different. Adapted from Figure 1 of (Veličković et al, 2019). 


The main idea of the model is to maximize the local mutual information between 
a node representation (capturing local graph information) and the graph represen- 
tation (capturing global graph information). By doing so, the learned node repre- 
sentation should capture the global graph information as much as possible. Let us 
denote the graph encoder as € which could be any GNN discussed before, e.g., a 
two-layer GCN. We can obtain all node representations as H = €(X,A) where the 
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representation h; of any node i should contain some local information near node i. 
Specifically, k-layer GCN should be able to leverage node information that is k-hop 
away. To get the global graph information, one could use a readout layer/function 
to process all node representations, i.e., s = Z(H), where the readout function 2 
could be some learnable pooling function or simply an average operator. 

Objective Given the local node representation h; and the global graph represen- 
tation s, the natural next-step is to compute their mutual information. Recall the 
definition of mutual information is as follows, 


I(h,s) =v (h,s) joe (- nae 7) dhas, (4.41) 


However, maximizing the local mutual information alone is not enough to learn 
useful representations as shown in (Hjelm et al, 2018). To develop a more practical 
objective, authors in (Veličković et al, 2019) instead use a noise-contrastive type 
objective following Deep Infomax (Hjelm et al, 2018), 


Me 


~ N+M , 


j 


N 
(ie xa) llog (his) + 2 Ega oe (1— 218; (4.42) 


where 2 is a binary classifier which takes both the node representation h; and the 
graph representation s as input and predicts whether the pair (h;,s) comes from the 
joint distribution p(h,s) (positive class) or the product of marginals p(h;) p(s) (neg- 
ative class). We denote h; as the j-th node representation from the negative sample. 
The numbers of positive: and negative samples are N and M respectively. We will 
explain how to draw positive and negative samples shortly. The overall objective is 
thus the negative binary cross-entropy for training a probabilistic classifier. Note that 
this objective is the same type of distance as used in generative adversarial networks 
(GANs) (Goodfellow et al, 2014b) which is shown to be proportional to the Jensen- 
Shannon divergence (Goodfellow et al, 2014b; Nowozin et al, 2016). As verified by 
(Hjelm et al, 2018), maximizing the Jensen-Shannon divergence based mutual in- 
formation estimator behaves similarly (i.e., they have an approximately monotonic 
relationship) to directly maximizing the mutual information. Therefore, maximizing 
the objective in Eq. (4.42) is expected to maximize the mutual information. More- 
over, the freedom of choosing negative samples makes the method more likely to 
learn useful representations than maximizing the vanilla mutual information. 

Negative Sampling To generate the positive samples, one can directly sample a 
few nodes from the graph to construct the pairs (h;,s). For negative samples, one can 
generate them via corrupting the original graph data, denoting as (X,A) = @(X,A). 
In practice, one can choose various forms of this corruption function @. For ex- 
ample, authors in (Veličković et al, 2019) suggest to keep the adjacency matrix to 
be the same and corrupt the node feature X by row-wise shuffling. Other possibili- 
ties of the corruption function include randomly sampling subgraphs and applying 
Dropout (Srivastava et al, 2014) to node features. 
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Once positive and negative samples were collected, one can learn the representa- 
tions via maximizing the objective in Eq. (4.42). We summarize the training process 
of Deep Graph Infomax as follows: 


. Sample negative examples via the corruption function (X,A) ~ @(X,A). 


. Compute graph representation via the readout function s = #(H). 
. Update parameters of £, 2, and # via gradient ascent to maximize Eq. (4.42). 
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4.3.2.3 Discussion 


Deep Graph Infomax is an efficient unsupervised representation learning method 
for graph-structured data. The implementation of the encoder, the readout, and the 
binary cross-entropy type of loss are all straightforward. The mini-batch training 
does not necessarily need to store the whole graph since the readout can be ap- 
plied to a set of subgraphs as well. Therefore, the method is memory-efficient. Also, 
the processing of positive and negative samples can be done in parallel. Moreover, 
authors prove that minimizing the cross-entropy type of classification error can be 
used to maximize the mutual information under certain conditions, e.g., the readout 
function is injective and input feature comes from a finite set. However, the choice 
of the corruption function seems to be crucial to ensure satisfying empirical perfor- 
mances. There seems no such a universally good corruption function. One needs to 
do trial-and-error to obtain a proper one depending on the task/dataset. 


4.4 Over-smoothing Problem 


Training deep graph neural networks by stacking multiple layers of graph neural 
networks usually yields inferior results, which is a common problem observed in 
many different graph neural network architectures. This is mainly due to the prob- 
lem of over-smoothing, which is first explicitly studied in (Li et al, 2018b). (Li et al, 
2018b) showed that the graph convolutional network (Kipf and Welling, 2017b) is 
a special case of Laplacian smoothing: 


Y = (1 — yI)X + YÃmX, (4.43) 


where Âw = Ď-!Ă, which defines the transitional probabilities between nodes on 
graphs. The GCN corresponds to a special case of Laplacian smoothing with y = 1 
and the symmetric matrix Asym = ĎÕ- tAD-2 is used. The Laplacian smoothing will 
push nodes belonging to the same clusters to take similar representations, which 
are beneficial for downstream tasks such as node classification. However, when the 
GCNs go deep, the node representations suffer from the problem of over-smoothing, 
i.e., all the nodes will have similar representations. As a result, the performance on 
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downstream tasks suffer as well. This phenomenon has later been pointed out by a 
few other later work as well such as (Zhao and Akoglu, 2019; Li et al, 2018b; Xu 
et al, 2018a; Li et al, 2019c; Rong et al, 2020b). 


PairNorm (Zhao and Akoglu, 2019). Next, we will present a method called 
PairNorm for alleviating the problem of over-smoothing when GNNs go deep. The 
essential idea of PairNorm is to keep the total pairwise squared distance (TPSD) 
of node representations unchanged, which is the same as that of the original node 
feature X. Let AH be the output of the node representations by the graph convolu- 
tion, which will be the input of PairNorm, and Ê is the output of PairNorm. The 
goal of PairNorm is to normalize the H such that after normalization TPSD(Ĥ) = 
TPSD(X). In other words, 


L We-All’+ Y A-A Y -xl E X-X 
GEE GEE GEE GEE 
(4.44) 
In practice, instead of measuring the TPSD of original node features X, (Zhao 
and Akoglu, 2019) proposed to maintain a constant TPSD value C across different 
graph convolutional layers. The value C will be a hyperparameter of the PairNorm 
layer, which can be tuned for each data set. To normalize A into Ê with a constant 
TPSD, we must first calculate the TPSD(H). However, this is very computationally 
expensive, which is quadratic to the number of nodes N. We notice that the TPSD 
can be reformulated as: 


: es 1. x i 
TPSD(A)= $ ||4;-Ajl|?=2N*( —Y -lE Allg | 445) 
(i,JEIN] Na NI 


We can further simply the above equation by substracting the row-wise mean 
from each Hj. In other words, Hf = Ĥ; — sa Hi, which denotes the centered 
representation. A nice property of centering the node representation is that it will 
not change the TPSD and meanwhile push the second term lly N | Ai\|5 to zero. 
As aresult, we have 


TPSD(H) = TPSD(A‘) = 2N||A‘ ||. (4.46) 


To summarize, the proposed PairNorm can be divded into two steps: center-and- 
scale, 


N 
yA, (Center) (4.47) 


A E He 
Ay =s: — = = s/N. —— (Scale), (4.48) 
w Er AEI ||| I 
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where s is a hyperparameter determining C. At the end, we have 


R A He 
TPSD(A) = 2N||Al|} = 2N} ||s- —————|5 = 20"? (4.49) 


1 ~ 
Vy Ler IIAP 


which is a constant across different graph convolutional layers. 


4.5 Summary 


In this chapter, we give a comprehensive introduction to different architectures 
of graph neural networks for node classification. These neural networks can be 
generally classified into two categories including supervised and unsupervised ap- 
proaches. For supervised approaches, the main difference among different architec- 
tures lie in how to propagate messages between nodes, how to aggregate the mes- 
sages from neighbors, and how to combine the aggregated messages from neighbors 
with the node representation itself. For the unsupervised approaches, the main dif- 
ference comes from designing the objective function. We also discuss a common 
problem of training deep graph neural networks—over-smoothing, and introduce a 
method to tackle it. In the future, promising directions on graph neural networks in- 
clude theoretical analysis for understanding the behaviors of graph neural networks, 
and applying them to a variety of fields and domains such as recommender sys- 
tems, knowledge graphs, drug and material discovery, computer vision, and natural 
language understanding. 


Editor’s Notes: Node classification task is one of the most important tasks 
in Graph Neural Networks. The node representation learning techniques in- 
troduced in this chapter are the corner stone for all other tasks for the rest 
of the book, including graph classification task (Chapter 9), link predic- 
tion (Chapter 10), graph generation task (Chapter 11), and so on. Familiar 
with the learning methodologies and design principles of node representa- 
tion learning is the key to deeply understanding other fundamental research 
directions like Theoretical analysis (Chapter 5), Scalability (Chapter 6), Ex- 
plainability (Chapter 7), and Adversarial Robustness (Chapter 8). 


A 
Chapter 5 eck oa 
The Expressive Power of Graph Neural 
Networks 


Pan Li and Jure Leskovec 


Abstract The success of neural networks is based on their strong expressive power 
that allows them to approximate complex non-linear mappings from features to 
predictions. Since the universal approximation theorem by (Cybenko, 1989), many 
studies have proved that feed-forward neural networks can approximate any func- 
tion of interest. However, these results have not been applied to graph neural net- 
works (GNNs) due to the inductive bias imposed by additional constraints on the 
GNN parameter space. New theoretical studies are needed to better understand these 
constraints and characterize the expressive power of GNNs. 

In this chapter, we will review the recent progress on the expressive power of GNNs 
in graph representation learning. We will start by introducing the most widely-used 
GNN framework— message passing— and analyze its power and limitations. We 
will next introduce some recently proposed techniques to overcome these limita- 
tions, such as injecting random attributes, injecting deterministic distance attributes, 
and building higher-order GNNs. We will present the key insights of these tech- 
niques and highlight their advantages and disadvantages. 


5.1 Introduction 


Machine learning problems can be abstracted as learning a mapping f* from some 
feature space to some target space. The solution to this problem is typically given 
by a model fg that intends to approximate f* via optimizing some parameter 0. 
In practice, the ground truth f* is a priori typically unknown. Therefore, one may 
expect the model fg to approximate a rather broad range of f*. An estimate of 
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how broad such a range could be, called the model’s expressive power, provides an 
important measure of the model potential. It is desirable to have models with a more 
expressive power that may learn more complex mapping functions. 

Neural networks (NNs) are well known for their great expressive power. Specifi- 
cally, Cybenko (1989) first proved that any continuous function defined over a com- 
pact space could be uniformly approximated by neural networks with sigmoid acti- 
vation functions and only one hidden layer. Later, this result got generalized to any 
squashing activation functions by (Hornik et al, 1989). 

However, these seminal findings are in- 
sufficient to explain the current unprece- 


dented success of NNs in practice because $ 

their strong expressive power only demon- R 

strates that the model fg is able to approx- « a 

imate f* but does not guarantee that the : < 

model obtained via training f indeed ap- 5 — - 
Traditional 


proximates f*. Fig. 5.1 illustrates a well- Machinelsamniie: 
known curve of Amount of Data vs. Per- SVM, GBDT 
formance of machine learning models (Ng, 
2011). NN-based methods may only out- 
perform traditional methods given suffi- 
cient data. One important reason is that Fig. 5.1: Amount of Data vs. Perfor- 
NNs as machine learning models are still mance of different models. 

governed by the fundamental tradeoff be- 

tween the data amount and model complex- 

ity (Fig. 5.2). Although NNs could be rather expressive, they are likely to overfit the 
training examples when paired with more parameters. Therefore, it is necessary for 
practice to build NNs that can maintain strong expressive power while constraints 
are imposed on their parameters. At the same time, a good theoretical understanding 
of the expressive power of NNs with constraints on their parameters is needed. 


Amount of data 


Optimal model Testing error » Naively improving the expressive power 
complexity by increasing model complexity 


® Improving the expressive power by 
injecting inductive bias into the model 
while keeping model complexity 


— Without inductive bias 
-- With inductive bias 


Model complexity 


Fig. 5.2: Training and testing errors with and without inductive bias can dramatically 
affect the expressive power of models. 


In practice, constraints on parameters are typically obtained from our prior 
knowledge of the data; these are referred to as inductive biases. Some significant 
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Fig. 5.3: Illustration of 1-dimensional translation invariance/variance. RNNs/CNNs 
use translation invariance to share parameters. 


results about the expressive power of NNs with inductive bias have been shown 
recently. Yarotsky (2017); Liang and Srikant (2017) have proved that deep neural 
networks (DNNs), by stacking multiple hidden layers, can achieve good enough 
approximation with significantly fewer parameters than shallow NNs. The archi- 
tecture of DNNs leverages the fact that data has typically a hierarchical structure. 
DNNs are agnostic to the type of data, while dedicated neural network architec- 
tures have been developed to support specific types of data. Recurrent neural net- 
works (RNNs) (Hochreiter and Schmidhuber, 1997) or convolution neural networks 
(CNNs) (LeCun et al, 1989) were proposed to process time series and images, re- 
spectively. In these two types of data, effective patterns typically hold translation 
invariance in time and in space, respectively. To match this invariance, RNNs and 
CNNs adopt the inductive bias that their parameters have shared across time and 
space (Fig. 5.3). The parameter-sharing mechanism works as a constraint on the 
parameters and limits the expressive power of RNNs and CNNs. However, RNNs 
and CNNs have been shown to have sufficient expressive power to learn transla- 
tion invariant functions (Siegelmann and Sontag, 1995; Cohen and Shashua, 2016; 
Khrulkov et al, 2018), which leads to the great practical success of RNNs and CNNs 
in processing time series and images. 

Recently, many studies have focused on a new type of NNs, termed graph neu- 
ral networks (GNNs) (Scarselli et al, 2008; Bruna et al, 2014; Kipf and Welling, 
2017a; Bronstein et al, 2017; Gilmer et al, 2017; Hamilton et al, 2017b; Battaglia 
et al, 2018). These aim to capture the inductive bias of graphs/networks, another 
important type of data. Graphs are commonly used to model complex relations and 
interactions between multiple elements and have been widely used in machine learn- 
ing applications, such as community detection, recommendation systems, molecule 
property prediction, and medicine design (Fortunato, 2010; Fouss et al, 2007; Pires 
et al, 2015). Compared to time series and images, which are well-structured and rep- 
resented by tables or grids, graphs are irregular and thus introduce new challenges. 
A fundamental assumption behind machine learning on graphs is that the targets 
for prediction should be invariant to the order of nodes of the graph. To match this 
assumption, GNNs hold a general inductive bias termed permutation invariance. In 
particular, the output given by GNNs should be independent of how the node indices 
of a graph are assigned and thus in which order are they processed. GNNs require 
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Permutation Invariance GNNs are built to match 


Feathers permutation al alae 
= 


Targets 


Fig. 5.4: This illustrates how GNNs are designed to maintain permutation invari- 
ance. 


their parameters to be independent from the node ordering and are shared across the 
entire graph (Fig. 5.4). Because of this new parameter sharing mechanism in GNNs, 
new theoretical tools are needed to characterize their expressive power. 

Analyzing the expressive power of GNNs is challenging, as this problem is 
closely related to some long-standing problems in graph theory. To understand this 
connection, consider the following example of how a GNN would predict whether a 
graph structure corresponds to a valid molecule. The GNN should be able to identify 
whether this graph structure is the same, similar, or very different from the graph 
structures that are known to correspond to valid molecules. Measuring whether two 
graphs have the same structure involves addressing the graph isomorphism prob- 
lem, in which no P solutions have yet been found (Helfgott et al, 2017). In addition, 
measuring whether two graphs have a similar structure requires contending with the 
graph edit distance problem, which is even harder to address than the graph isomor- 
phism problem (Lewis et al, 1983). 

Great progress has been made recently on characterizing the expressive power of 
GNNs, especially on how to match their power with traditional graph algorithms and 
how to build more powerful GNNs that overcome the limitation of those algorithms. 
We will delve more into these recent efforts further along in this chapter. In par- 
ticular, compared to previous introductions (Hamilton, 2020; Sato, 2020), we will 
focus on recent key insights and techniques that yield more powerful GNNs. Specifi- 
cally, we will introduce standard message-passing GNNs that are able to achieve the 
limit of the 1-dimensional Weisfeiler-Lehman test (Weisfeiler and Leman, 1968), a 
widely-used algorithm to test for graph isomorphism. We will also discuss a number 
of strategies to overcome the limitations of the Weisfeiler-Lehman test — including 
attaching random attributes, attaching deterministic distance attributes, and leverag- 
ing higher-order structures. 

In Section 5.2, we will formulate the graph representation learning problems that 
GNNs target. In Section 5.3, we will review the most widely used GNN frame- 
work, the message passing neural network, describing the limitations of its expres- 
sive power and discussing its efficient implementations. In Section 5.4, we will in- 
troduce a number of methods that make GNNs more powerful than the message 
passing neural network. In Section 5.5, we will conclude this chapter by discussing 
further research directions. 
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Fig. 5.5: An illustration of the expressive power of NNs and GNNs and their affects 
on the performance of learned models. a) Machine learning problems aim to learn 
the mapping from the feature space to the target space based on several observed 
examples. b) The expressive power of NNs refers to the gap between the two spaces 
F and F”. Although NNs are expressive (Ê! is dense in F), the learned model 
f’ based on NNs may differ significantly from f* due to their overfit of the limited 
observed data. c) Suppose f* is known to be permutation invariant, as it works on 
graph-structured data. Then, the space of potential mapping functions is reduced 
from F’ to a much smaller space ¥ that only includes permutation invariant func- 
tions. If we adopt GNNs, the space of mapping functions that can be approximated 
simultaneously reduces to Ê. The gap between F and Ê characterizes the ex- 
pressive power of GNNs. d) Although GNNs are less expressive than general NNs 
(Ê c #'), the learned model based on GNNs f is a much better approximator of 
f* as opposed to the one based on NNs f’. Therefore, for graph-structured data, our 
understanding of the expressive power of GNNs, i.e., the gap between F and Ê, is 
much more relevant than that of NNs. 


5.2 Graph Representation Learning and Problem Formulation 


In this section, we will set up the formal definition of graph representation learning 
problems, their fundamental assumption, and their inductive bias. We will also dis- 
cuss relationships between different notions of graph representation learning prob- 
lems frequently studied in recent literature. 

First, we will start by defining graph-structured data. 


Definition 5.1. (Graph-structured data) Let Y = (V,&,X) denote an attributed 
graph, where Y is the node set, £ is the edge set, and X € R!”I*F are the node 
attributes. Each row of X, X, € R” refers to the attributes on the node v € Y. In 
practice, graphs are usually sparse, i.e., || <|V|*. We introduce A € {0, 141x11 
to denote the adjacency matrix of G such that A,, = 1 iff (u,v) € E. Combining the 
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Fig. 5.6: Graph representation learning problems frequently discussed in literature. 


adjacency matrix and node attributes, we may also denote Y = (A,X). Moreover, if 
@ is unattributed with no node attributes, we can assume that all elements in X are 
constant. Later, we also use ¥ [Y] to denote the entire node set of a particular graph 


g. 


The goal of graph representation learning is to learn a model by taking graph- 
structured data as input and then mapping it so that certain prediction targets are 
matched. Different graph representation learning problems may apply to a varying 
number of nodes in a graph. For example, for node classification, a prediction is 
made for each node, for each link/relation prediction on a pair of nodes, and for 
each graph classification or graph property prediction on the entire node set V. We 
can unify all these problems as graph representation learning. 


Definition 5.2. (Graph representation learning) The feature space is defined as 
X :=I x Z, where T is the space of graph-structured data and .Y includes all 
the node subsets of interest, given a graph % € IT. Then, a point in 2 can be de- 
noted as (4 ,S), where S is a subset of nodes that are in Y. Later, we call (4, S) as 
a graph representation learning (GRL) example. Each GRL example (¥,S) € X is 
associated with a target y in the target space &. Suppose the ground-truth associa- 
tion function between features and targets is denoted by f*: 2 —> Y, f*(G,S) =y. 
Given a set of training examples Z = {(Y ,S ,y)}«_, and a set of testing exam- 
ples ¥ = {(G,§ ®©), a graph representation learning problem is to learn 
a function f based on Æ such that f is close to f* on ¥. 


The above definition is general in the sense that ina GRL example (9, S) € X, g 
provides both raw and structural features on which some prediction for a node subset 
S of interest is to be made. Below, we will further list a few frequently-investigated 
learning problems that may be formulated as graph representation learning prob- 
lems. 


Remark 5.1. (Graph classification problem / Graph-level prediction) The node set $ 
of interest is the entire node set V |Y] by default. The space of graph-structured data 
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IT typically contains multiple graphs. The target space Y contains labels of different 
graphs. Later, for graph-level prediction, we will use ¥ to denote a GRL example 
instead of (¥,S) for notational simplicity. 


Remark 5.2. (Node classification problem / Node-level prediction) In a GRL exam- 
ple (Y,S), the S corresponds to one single node of interest. The corresponding Y 
can be defined in different ways. On the one hand, only the nodes close to S provide 
effective features. In this case, Y may be set as the induced local subgraph around 
S. Different Y’s for different S’s may come from a single graph. On the other hand, 
two nodes that are far apart on one graph still hold mutual impact and can be used 
as a feature to make a prediction on another graph. In that case, Y needs to include 
a large portion of a graph or even the entire graph. 


Remark 5.3. (Link prediction problem / Node-pair-level prediction) In a GRL ex- 
ample (Y, 5), S corresponds to a pair of nodes of interest. Similar to the node classi- 
fication problem, Y for each example may be an induced subgraph around S or the 
entire graph. The target space Y contains 0-1 labels that indicate whether there is a 
probable link between two nodes. Y may also be generalized to include labels that 
reflect the types of links to be predicted. 


Next, we will introduce the fundamental assumption used in most graph repre- 
sentation learning problems. 


Definition 5.3. omen) Consider two GRL examples (4), 5“), (¢, §)) 
E X. Suppose YY) = (AY, xX“) and GY?) = (A?) xX ?)), If there exists a bijective 


mapping 1: V [2 ®] > y\ge )], i € {1,2}, such that ASW SA iss xf) =x, 


and 7 also gives a bijective mapping between S and S®, we call that (g) ; s)) 
and (Y(2), S©)) are isomorphic, denoted as (¥"),S()) = (¢2), S)). When the par- 


T 
ticular bijective mapping 7 should be highlighted, we use notation (¢ 0) s0) = 
(G 2) § (2)), If there is no such a 7, we call that they are non-isomorphic, denoted as 


(GS) x (g2) s2). 


Assumption 1 (Fundamental assumption in graph representation learning) Con- 
sider a graph representation learning problem with il aise hier K pa A 
responding target space Y. Pick any two GRL examples (g! 0)), (G' (2)) € 
X. The fundamental assumption says that if (Y\)),S)) = ag s®), corre- 
sponding targets in Y are the same. 


Due to this fundamental assumption, it is natural to introduce the corresponding 
permutation invariance as inductive bias that all models of graph representation 
learning should satisfy. 


Definition 5.4. (Permutation invariance) A model f satisfies permutation invari- 


ance if for any (®©) S) = (¢@,s@), f(g, s) = f(g), s0). 


Now we may define the expressive power of a model for graph representation 
learning problems. 
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Definition 5.5. (Expressive power) Consider a feature space 2 of a graph rep- 
resentation learning problem and a model f defined on X. Define another space 
X (f) as a subspace of the quotient space 2°/ = such that for two GRL exam- 
ples (8), (G2), S) € X (F), A(G, S) # f(Y,S). Then, the size 
of X (f) characterizes the expressive power of f. For two models, f (D and f (2), if 
X (FO) X (Ff), we say that f is more expressive than f(). 


Remark 5.4. Note that the expressive power in Def. 5.5, characterized by how a 
model can distinguish non-isomorphic GRL examples, does not exactly match the 
traditional expressive power used for NNs in the sense of functional approxima- 
tion. Actually, Def. 5.5 is strictly weaker because distinguishing any non-isomorphic 
GRL examples does not necessarily indicate that we can approximate any function 
f“ defined over 2°. However, if a model f cannot distinguish two non-isomorphic 
features, f is definitely unable to approximate function f* that maps these two ex- 
amples to two different targets. Some recent studies have been able to prove some 
equivalence between distinguishing non-isomorphic features and permutation in- 
variant function approximations under weak assumptions and applying involved 
techniques (Chen et al, 2019f; Azizian and Lelarge, 2020). Interested readers may 
check these references for more details. 


It is trivial to provide the expressive power of a model f for graph representa- 
tion learning if f does not satisfy permutation invariance. Without such a constraint, 
NNs can approximate all continuous functions (Cybenko, 1989), which include the 
continuous functions that distinguish any non-isomorphic GRL examples. There- 
fore, the key question we are to discuss in the chapter is: “How to build the most 
expressive permutation invariant models, GNNs in particular, for graph representa- 
tion learning problems?” 


5.3 The Power of Message Passing Graph Neural Networks 


5.3.1 Preliminaries: Neural Networks for Sets 


We will start by reviewing the NNs with sets (multisets) as their input,since a set 
can be viewed as a simplified-version of a graph where all edges are removed. By 
definition, the order of elements of a set does not impact the output; models that 
encode sets naturally provide an important building block for encoding the graphs. 
We term this approach invariant pooling. 


Definition 5.6. (Multiset) A multiset is a set where its elements can be repetitive, 
meaning that they are present multiple times. In this chapter, we assume by default 
that all the sets are multisets and thus allow repetitive elements. In situations where 
this is not the case, we will indicate otherwise. 
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Definition 5.7. (Invariant pooling) Given a multiset of vectors S = {a;,a,...,ax} 
where a; € R” and F is an arbitrary constant, an invariant pooling refers to a map- 
ping, denoted as q(S), that is invariant to the order of elements in S. 


Some widely-used invariant pooling operations include: sum pooling q(S) = 
ye, a;, mean pooling q(S) = HEL, a; and max pooling [q(S)]; = maXiep, rj {ai;j} 
for all j € [1,F]. Zaheer et al (2017) show that any invariant poolings of a set S can 
be approximated by q(S) = ¢(Y_, w(a;)), where @ and y are functions that may be 
approximated by fully connected NNs, provided that a;, i € [k] comes from a count- 
able universe. This statement can be generalized to the case where S is a multiset 
(Xu et al, 2019d). 


5.3.2 Message Passing Graph Neural Networks 


Message passing is the most widely-used framework to build GNNs (Gilmer et al, 
2017). Given a graph Y = (V,&,X), the message passing framework encodes each 
node v € Y with a vector representation h, and keeps updating this node represen- 
tation by iteratively collecting representations of its neighbors and applying neural 
network layers to perform a non-linear transformation of those collections: 


1. Initialize node vector representations as node attributes: h® +X, Wer. 
2. Update each node representation based on message passing over the graph 
structure. In /-th layer, l = 1,2,...,L, perform the following steps: 
Message: mV — MSG" WY?) Vuege, 6D 
Aggregation: a® AGG({m\") ju E MY, WEY, (5.2) 
Update: hP — UPTE”? al?) Wey. (5.3) 
where .% is the set of neighbors of v. 


The operations MSG, AGG, and UPT MP-GNN to learn the node embedding of the node A: h? 
can be implemented via neural networks. 


Q (2) 
Typically, MSG is implemented by a feed- ? a P F 
forward NN, eg., MSG(p,q) = o(pWı + o—o oem 
qW2), where W; and W are learnable Doi 
weights, and o(-) is an element-wise non- E we) 4 

linear activation. UPT can be chosen in a X MSG(.-) "ogi o o° 
similar way as MSG. AGG differs as its in- | ‘one neural layer me ap Baii 


put is a multiset of vectors and thus the or- 

der of these vectors should not affect the Fig. 5.7: The computing flow of MP- 
output. AGG is typically implementedasan GNN to obtain a node representation. 
invariant pooling (Def. 5.7). Each layer k 

can have different parameters from other layers. We will denote the GNNs that fol- 
low this message passing framework as MP-GNN. 
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MP-GNN produces representations of all the nodes, nv!) v € V}. Each node 
representation is essentially determined by a subtree rooted at this node (Fig. 5.7). 
Given a specific graph representation learning problem, for example, classifying a 
set of nodes S C V, we may use the representations of relevant nodes in S to make 
the prediction: 


$s = READOUT ({hP |v € S}). (5.4) 


where the READOUT operation is often implemented via another invariant pooling 
when |S] > 1 plus a feed-forward NN to predict the target. Combining Eqs.equation 11.45- 
equation 5.4, MP-GNN builds a GNN model for graph representation learning: 


fs = fur—cnn(Y,S). (5.5) 


We can show the permutation invariance of MP-GNN by induction over the iter- 
ation index /. 


Theorem 5.1. (Invariance of MP-GNN) fyp—cnn(-,-) satisfies permutation invari- 
ance (Def. 5.4) as long as the AGG and READOUT operations are invariant pooling 
operations (Def. 5.7). 


Proof. This can be proved trivially by induction. 


MP-GNN by default leverages the inductive bias that the nodes in the graph di- 
rectly affect each other only via their connected edges. The mutual effect between 
nodes that are not connected by an edge can be captured via paths that connect 
these nodes via message passing. Indeed, such inductive bias may not match the 
assumptions in a specific application, and MP-GNN may find it hard to capture mu- 
tual effect between two far-away nodes. However, the message-passing framework 
has several benefits for model implementation and practical deployment. First, it 
directly works on the original graph structure and no pre-processing is needed. Sec- 
ond, graphs in practice are typically sparse (|| « |7|?) and thus MP-GNN is able 
to scale to very large but sparse graphs. Third, each of the three operations MSG, 
AGG, and UPT can be computed in parallel across all nodes and edges, which is 
beneficial for parallel computing platforms such as GPUs and map-reduce systems. 

Because it is natural and easy to be implemented in practice, most GNN architec- 
tures essentially follow the MP-GNN framework by adopting specific MSG, AGG, 
and UPT operations. Representative approaches include InteractionNet (Battaglia 
et al, 2016), structure2vec (Dai et al, 2016), GCN (Kipf and Welling, 2017a), Graph- 
SAGE (Hamilton et al, 2017b), GAT (Veličković et al, 2018), GIN (Xu et al, 2019d), 
and many others (Kearnes et al, 2016; Zhang et al, 2018g). 


5.3.3 The Expressive Power of MP-GNN 


In this section, we will introduce the expressive power of MP-GNN , following the 
results proposed in Xu et al (2019d); Morris et al (2019). 
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The 1-dimensional Weisfeiler-Lehman test to distinguish (¥") ,S“)) and (Y2),S@)): 


1. Assume each node v in VIG] is initialized with a color cl) co x?) fori = 1,2. If xX 
is a vector, then an injective function is used to map it to a color. 
2. For! =1,2.,..., do 


Update node colors: ci") — HASH(c! on 1E Diy eM), ic {1,2} 
(5.6) 


where the HASH operation can be viewed as an injective mapping where different tuples 
(c G= (cee Du € MY are mapped to different labels. 


Test: If two multisets {ct Diy e S} and {ce Diy € S} are not equal, 
then return (¥) , S) % (4 ® ,S2)); else, go back to equation 5.6. 


If 1-WL test returns (Y"),() 2 (¥@), S)), we know that (Y'),) (42) ,S®) are not 
isomorphic. However, for some non-isomorphic (¥"!),S)) (¢@,S)), the 1-WL test may 
not return (Y") ,S0)) 4 (42 S) (even with infinitely many iterations). In this case, the 1- 
WL test fails to distinguish them. Note that the 1-WL test was originally proposed to test the 
isomorphism of two entire graphs, i.e.,, S @ = WIG ] for i € {1,2} (Weisfeiler and Leman, 
1968). Here the 1-WL test is further generalized to test the case where SAC yl), to match 
the general graph representation learning problems. 


The expressive power we defined (Def. 5.5) is closely related to the graph iso- 
morphism problem. This problem is challenging, as no polynomial-time algorithms 
have been found for it (Garey, 1979; Garey and Johnson, 2002; Babai, 2016). De- 
spite some corner cases (Cai et al, 1992), the Weisfeiler-Lehman (WL) tests of graph 
isomorphism (Weisfeiler and Leman, 1968) are a family of effective and computa- 
tionally efficient tests that distinguish a broad class of graphs (Babai and Kucera, 
1979). Its 1-dimensional form (the 1-WL test), “naive vertex refinement’, is analo- 
gous to the neighborhood aggregation in MP-GNN . 

They are comparing MP-GNN with the 1-WL test, the node-representation up- 
dating procedure Eqs.equation |1.45-equation 5.3 can be viewed as an implemen- 
tation of Eq.equation 5.6 and the READOUT operation in Eq.equation 5.4 can 
be viewed as a summary of all node representations. Although MP-GNN was 
not proposed to perform graph isomorphism testing, the fmp-gyny can be used 
for this test: if fup—cnn(Y™),S™) Æ fur—cnn(4™,S™), then we know that 
(9, S) % (g@),S'?)). Because of this analogy, the expressive power of MP- 
GNN can be measured by the 1-WL test. Formally, we conclude the argument in the 
following theorem. 


Theorem 5.2. (Lemma 2 in (Xu et al, 2019d), Theorem 1 in lie. i ka 2019)) 
Consider two non-isomorphic GRL examples (™® ,S®) and (¢),S)). If 
fur—cnn(4@,S) Z furre_cnn(G,S), then the 1-WL test Ws decides 
(GS) and (Y?),S)) are not isomorphic. 


Theorem 5.2 indicates that MP-GNN is at most as powerful as the 1-WL test 
in distinguishing different graph-structured features. Here, the 1-WL test is consid- 
ered an upper bound (instead of being equal to the expressive power of MP-GNN) 
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í SOEC 3 
Step 1: Each node is initialized with 
some color according to its attribute 
(if no attributes, use the same color). 
The mapping “attributes — colors” is injective. 
C pping J F. 
Í Step 2: Each node will collect the colors È 
from their neighbors: 
Node A: (p,{bby}) 
Left node E: (b,{py}; 
Right node E: (b,{pyg}) ... 
a The mapping “(self-color, set of colors from neighbors) > anew color” is injective p. 


After each iteration , check the set of node colors. Current both graphs have the same set of colors. 
We do step 2 again. After two iterations, we may distinguish these two graphs because left B will 
get a color that will not appear in the right graph, because currently left B has purple + blue in its 
neighborhood while no nodes in the right graph have such neighborhood. 


Fig. 5.8: An illustration of steps that distinguish two graphs via the 1-dimensional 
Weisfeiler-Lehman test. MP-GNN follows a similar procedure and may also distin- 
guish them. 


because the updating procedure which aggregates node colors from its neighbors 
(Eq.equation 5.6) is injective and can distinguish between the different aggregations 
of node colors. This intuition is useful later to design MP-GNN that matches this 
upper bound. 

Now that the upper bound of the representation power of MP-GNN has been 
established, a natural follow-up question is whether there are existing GNNs that 
are, in principle, as powerful as the 1-WL test. The answer is yes. As shown by 
Theorem 5.3: if the message passing operation (compositing Eqs.equation 11.45- 
equation 5.3 together) and the final READOUT (Eq.equation 5.4) are both injective, 
then the resulting MP-GNN is as powerful as the 1-WL test. 


Theorem 5.3. (Theorem 3 in (Xu et al, 2019d)) After sufficient iterations, MP-GNN 
may map any GRL examples ("),S“)) and (¢2),S), that the 1-WL test decides 
as non-isomorphic, to different representations if the following two conditions hold: 


a) The composition of MSE, AGG and UPT (Eqs.equation 11.45-equation 5.3) 
constructs an injective mapping from ni) ae) ju E -%}) to ni”), 


b) The READOUT (Eq.equation 5.4) is injective. 


Although MP-GNN does not surpass the representation power of the 1-WL test, 
MP-GNN has important benefits over the 1-WL test from the perspective of ma- 
chine learning: node colors and the final decision given by the 1-WL test are dis- 
crete (represented as node colors or a “yes/no” decision) and thus cannot capture the 
similarity between graph structures. In contrast, a MP-GNN satisfying the criteria in 
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Theorem 5.3 generalizes the 1-WL test by learning to represent the graph structures 
with vectors in a continuous space. This enables MP-GNN to not only discrimi- 
nate between different structures but also to learn to map similar graph structures 
to similar representations, thus capturing dependencies between graph structures. 
Such learned representations are particularly helpful for generalizations where data 
contains noisy edges and the exact matching graph structures are sparse (Yanardag 
and Vishwanathan, 2015). 

In the next subsection, we will focus on introducing the key design ideas behind 
MP-GNN that satisfies the conditions in Theorem 5.3. 


5.3.4 MP-GNN with the Power of the 1-WL Test 


Xu et al (2019d) introduced the key guidelines to satisfy the conditions in Theo- 
rem 5.3. First, to model injective multiset functions for the neighbor aggregation, 
the AGG operation (Eq.equation 15.16) is suggested to adopt the sum pooling op- 
eration, which is proved to universally represent functions defined over multisets 
whose elements are from a countable space (Lemma 5.1). 


Lemma 5.1. (Lemma 4 in (Xu et al, 2019d)) Suppose Z is a countable universe 
of elements. Then there exists a function q : Z —> R” such that q(S) = Yes W(x) is 
unique for each finite multiset S C Z, where y individually encodes each element in 
S. Moreover, any multiset function g can be decomposed as g (S) = 9 (xes W(x)) 
for some function @. 


Remark 5.5. Note that the sum pooling operator is crucial, as some popular invari- 
ant pooling operators, such as the mean pooling operator, are not injective multiset 
functions. The significance of the sum pooling operation is to record the number 
of repetitive elements in a multiset. The mean pooling operation adopted by graph 
convolutional network (Kipf and Welling, 2017a) or the softmax-normalization (at- 
tention) pooling adopted by graph attention network (Veličković et al, 2018) may 
learn the distribution of the elements in a multiset but not the precise counts of the 
elements. 


Thanks to the universal approximation theorem (Hornik et al, 1989), we can use 
multi-layer perceptrons (MLPs) to model and learn y and @ in Lemma 5.1 for uni- 
versally injective AGG operation. In MP-GNN, we do not even need to explicitly 
model y and @ as the MSG and UPT operations — (Eqs.equation 11.45 and equa- 
tion 5.3) respectfully — have already been implemented via MLPs. Therefore, using 
the sum pooling as the AGG operation is sufficient to achieve the most expressive 
MP-GNN: 
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Expressive Message: mË + MLP” ni) ehf), V(u,v) € &, 
Expressive Aggregation: all) 1} y m), wer, 
uE 
Expressive Update: ni? — MLP) (hÉ Yo aP), Wey. 


where © denotes concatenation. Actually, we can even simplify the procedure by 
using a single MLP. We can also set mÉ) > ni) V(u,v) € E without decreasing 
the expressive power. Combining all the terms together, Xu et al (2019d) obtains 
the simplest update mechanism of node representations that constructs an injective 


mapping from (hy), {hi |u €.%}) to hy”: 


h® — MLPOY (14O E HY), Wey, (5.7) 
ue 


where g®) is a learnable weight. This updating method, by using a NN-based lan- 
guage, is termed the graph isomorphism network (GIN) layer (Xu et al, 2019d). 

Lemma 5.2 formally states that MP-GNN that adopts Eq.equation 5.7 may match 
the condition a) in Theorem 5.3. 


Lemma 5.2. Updating node representations by following Eq.equation 5.7 con- 
structs an injective mapping from BED hOu E M }) to h, if the node at- 


tributes X are from a countable space. 


Proof. Combine the proof for injectiveness of the sum aggregation with the univer- 
sal approximation property of MLP (Hornik et al, 1989). 


A similar idea may be adapted to the READOUT operation (Eq.5.4), which also 
requires an injective mapping of multisets: 


Expressive Inference: $5 = MLP(Y’ hË )), (5.8) 
ves 


Xu et al (2019d) has observed that node representations from earlier iterations may 
sometimes generalize better and thus also suggests using the READOUT (a counter- 
part to Eq.5.4) from the Jumping Knowledge Network (JK-Net) (Xu et al, 2018a), 
though it is not necessary from the perspective of the representation power of MP- 
GNN. 

Overall, combining the update Eq.equation 5.7 and the READOUT Eq.equation 5.8, 
we may achieve an MP-GNN that is as powerful as the 1-WL test. In the next sec- 
tion, we introduce several techniques that allow MP-GNN to break the limitation of 
the 1-WL test and achieve even stronger expressive power. 
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5.4 Graph Neural Networks Architectures that are more 
Powerful than 1-WL Test 


In the previous section, we characterized the representation power of MP-GNN that 
is bounded by the 1-WL test. In other words, if the 1-WL test cannot distinguish two 
GRL examples (¥"), S) and (Y),S()), then MP-GNN cannot distinguish them 
either. Although the 1-WL test cannot distinguish only a few corner graph structures, 
it indeed limits the applicability of GNNs in many real-world applications (You et al, 
2019; Chen et al, 2020q; Ying et al, 2020b). In this section, we will introduce several 
approaches to overcome the above limitation of MP-GNN. 


5.4.1 Limitations of MP-GNN 


First, we will review several critical limitations of MP-GNN and the 1-WL test to 
gain the intuition for understanding the techniques that build more powerful GNNSs. 
MP-GNN iteratively updates the representation of each node by aggregating repre- 
sentations of its neighbors. The obtained node representation essentially encodes the 
subtree rooted at Node v (Fig. 5.7). However, using this rooted subtree to represent 
a node may lose useful information, such as: 


1. The information about the distance between multiple nodes is lost. For example, 
You et al (2019) noticed that MP-GNN has limited power in capturing the po- 
sition/location of a given node with respect to another node in the graph. Many 
nodes may share similar subtrees, and thus, MP-GNN produces the same rep- 
resentation for them although the nodes may be located at different locations in 
the graph. This location information of nodes is crucial for the tasks that depend 
on multiple nodes, such as link prediction (Liben-Nowell and Kleinberg, 2007), 
as two nodes that tend to be connected with a link are typically located close to 
each other. An illustrative example is shown in Fig. 5.9. 

2. The information about cycles is lost. Particularly, when expanding the subtree of 
a node, MP-GNN essentially losses track of the node identities in the subtrees. 
An illustrative example is shown in Fig. 5.10. The information about cycles 
is crucial in applications such as subgraph matching (Ying et al, 2020b) and 
counting (Liu et al, 2020e) because loops frequently appear in the queried sub- 
graph patterns of a subgraph matching/counting problem. Chen et al (2020q) 
formally proved that MP-GNN is able to count star structures (a particular form 
of trees) but cannot count connected subgraphs with three or more nodes that 
form cycles. 


Theoretically, there is a general class of graph representation learning problems 
that MP-GNN will fail to solve due to its limited representation. To show this, we 
define a class of graphs, termed attributed regular graphs. 
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Query: Which one is more 
O likely the predator of 


oyote Seal S i 
x Ora Pelagic Fish, Lynx or Orca? 
White Spruce Snowshoe Hare Zooplankton Krill 
ound Squiy enguin 


Red Fox Baleen Whale 


Fig. 5.9: A foodweb network example that demonstrates limitations of MP-GNN 
(Srinivasan and Ribeiro, 2020a). MP-GNN will associate Lynx and Orca with the 


same node representations, i.e., ha = hË a as these two nodes hold the same 
rooted subtree. Note that we do not consider node features. Thus, MP-GNN cannot 
predict whether Lynx or Orca is more likely to be the predator of Pelagic Fish (a 


link prediction task). 


Corresponding subtrees: 
< A <> 
co GO PA 
(L) 


Fig. 5.10: The node representations hÉ ) and ni given by MP-GNN are the same, 
although they belong to different cycles — 3-cycle and 6-cycle, respectively. 


Definition 5.8. (Attributed regular graphs) Consider an attributed graph Y = (V, €,X). 
All nodes in ¥ are partitioned according to their attributes ⁄ = UL Wi such that 
two nodes from the same category V; have the same attributes, while two nodes 
from different categories have different attributes. If for any two categories, V;, Vj, 
i,j € |k], for any two nodes u,v € V;, the number of neighbors of u in V; and the 
number of neighbors of v in V; are equal, this graph can also be termed attributed 
regular graph. Denote C; as the attribute of nodes in V;. Also, denote the number 
of neighbors in V; of a node v € V; as r;;. Then, the configuration of this attributed 
regular graph can be represented as a set of tuples Config(¥) = {(Ci,Cj, rij) }i jelk 


Note that the definition of attributed regular graphs is similar to k-partite regular 
graphs, while attributed regular graphs allow edges connecting nodes from the same 
partition. It can be shown that the 1-WL test will color all the nodes of one partition 
in the same way. Based on the bound of representation power of MP-GNN (Theo- 
rem 5.2), we can obtain the following corollary about the impossibility of MP-GNN 
to distinguish GRL examples defined on attributed regular graphs. Fig. 5.11 gives 
some examples that illustrate the impossibility. Actually, with sufficient layers (it- 
erations), MP-GNN (the 1-WL test) will always transform any attributed graph into 
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(1) EEOC ` “ (1) 
1 ¢ pe ag roi oie 
i = i i Z 
| | | 
! i E 
1 1i : i i 
\ Gus Ger 


Fig. 5.11: A pair of attributed regular graphs Y () Y() with the same configuration 
and a proper selection of S (1), s@) : MP-GNN and the 1-WL test fail to distinguish 
(GY, SM), (G2), s2). 


an attributed regular graph (Arvind et al, 2019) if we view the node representations 
obtained by MP-GNN as the node attributes on this transformed graph !. 


Corollary 5.1. Consider two graph-structured features (G\),S\), (GS). If 
two attributed regular graphs G!), GO share the ane el eee i. n Cpa 
Config(Y), and two multisets of attributes (xi! Ny e S(} and ix? My €S?)} are 
also equal, then fyp_cnn(4@),S) = fup- aun (9, s2). aa a Ta 
representation learning problems associate ix? Ny €S(} and (x? My e S®)} with 
different targets, MP-GNN does not hold the expressive power to distinguish them 
and predict their correct targets. 


Proof. The proof is obtained by tracking each iteration of the 1-WL test and per- 
forming an induction. 


Next, we will introduce several approaches that address the above limitations and 
that further improve the expressive power of MP-GNN . 


5.4.2 Injecting Random Attributes 


The main reason for limitations on the expressive power of MP-GNN is that MP- 
GNN does not track node identities; however, different nodes with the same at- 
tributes will be initialized with the same vector representations. This condition will 
be maintained unless their neighbors propagate different node representations. One 
way to improve the expressive power of MP-GNN is to inject each node with a 
unique attribute. Given a GRL example (¥,S), where Y = (A,X), 


gi(G,S) =(G,S), where Y = (A,X OD, (5.9) 


where © is concatenation and I is an identity matrix, this gives each node a unique 
one-hot encoding and yields a new attributed graph Gz. The composite model 


' Most transformed graphs have one single node per partition. In this case, two graphs that share 
the same configuration are isomorphic. 


80 Pan Li and Jure Leskovec 


JuP—GNN © gr increases expressive power as node identities are attached to the mes- 
sages in the message passing framework and the distance and loop information can 
be learnt with sufficient iterations of message propagation. 

However, the limitation of the above framework is that it is not permutation in- 
variant (Def.5.4): given that two isomorphic GRL examples (4), S() = (G@,§@)), 
g(G"),S) and g(GY),S@) may be not isomorphic any more. Then, the com- 
posite model fyp_Gnn og (G'), 5) may not equal fyp—Gnn 0g (G9), S$), As 
the obtained model loses the fundamental inductive bias of graph representation 
learning, it is hard to be generalized’. 


Remark 5.6. Some other approaches may share the same limitation with gy, e.g., 
using the adjacency matrix A (each row of A representing node attributes). However, 
Srinivasan and Ribeiro (2020a) argued that node embeddings obtained via matrix 
factorization, such as deepwalk (Perozzi et al, 2014) and node2vec (Grover and 
Leskovec, 2016), can keep the required invariance and thus are still generalizable. 
We will return to this concept in Sec.5.4.2.4. 


To overcome the above limitation, different methods have been proposed re- 
cently. These models share the following strategy: they first design some additional 
random node attributes Z, use them to argue the original dataset, and then learn a 
GNN model over the augmented dataset (Fig. 5.13). 

The obtained models will be more expressive, as the random node attributes can 
be viewed as unique node identities that distinguish nodes. However, if the model 
is only trained based on a single GRL example augmented by these random at- 
tributes, it cannot keep invariance as discussed above. Instead, the model needs 
to be trained over multiple GRL examples augmented by independently injected 
random attributes. The new augmented GRL examples have the same target as the 
original GRL examples from which they are generated. This training of models over 
augmented examples essentially regularizes the permutation variance of the models 
and makes them behave almost “permutation invariant.” 

Different methods to inject these random attributes may be adopted, but a direct 
way is to attach Z to X, i.e., given a graph-structured data (Y,S), where Y = (A,X), 


e2z(Y,S) =(Gz,S), where Y = (A,Xz) and Xz = X @Z. (5.10) 


Note that for each realization Z, the composite model fyp—cwwn © gz is not permu- 
tation invariant. Instead, all these approaches make E[fyp—cww ° gz] permutation 
invariant and expect the models to keep invariant in expectation. To match such 
invariance in expectation, an approach must satisfy the following proposition. 


Proposition 5.1. The following two properties are needed to build a model by in- 
jecting random features Z. 


2 Recent literature often states that the composite model is not inductive. Inductiveness and gen- 
eralization to unobserved examples are related. In the transductive setting, fizp_cnw © gy is less 
generalizable than fiyp—cGnw, although the prediction performance of fip_Gnn © gr may be some- 
times better than fiyp—cnw due to the much stronger expressive power of fMP-GNN © 81- 
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original attributes Jj - — - - 
Types of random attributes Positional information | Model & reference 

random BEDS P 

attributes > Random permutations No RP-GNN (Murphy et al, 2019) 
(Almost uniform) Discrete r.v. No rGIN (Sato et al, 2020) 

a E Distances to random anchor sets Yes PGNN (You et al, 2019) 

Graph-convoluted Gaussian r.v. Yes CGNN (Srinivasan & Ribeiro, 2020) 
Random signed Laplacian eigenmap Yes LE-GNN (Dwivedi et al, 2020) 


Fig. 5.12: Injecting random node attributes can improve the expressive power of 
GNNs. Different types of random node attributes are adopted in different works. 
Some random node attributes contain node positional information (the position of a 
node with respect to other nodes in the graph). 


1. A sufficient number of Z’s should be sampled during the training stage so that 
the model indeed captures permutation invariance in expectation. 
2. The randomness in Z should be agnostic to the original node identities. 


To satisfy the property 1, a method suggests that for each forward pass to com- 
pute fup—cGwnn © gz during the training stage, one Z should be re-sampled once or 
multiple times to get enough data argumentation. To satisfy the property 2, four 
different types of random Z have been proposed as described next. 


5.4.2.1 Relational Pooling - GNN (RP-GNN) (Murphy et al, 2019a) 


Murphy et al (2019a) considered randomly assigning an order of nodes as their extra 
attributes and proposed the model relational pooling GNN (RP-GNN). We use Zrp 
to denote additional node attributes Z used in RP-GNN. Suppose the graph ¥ has 
n nodes, Zrp is uniformly sampled from all possible permutation matrices. That 
is, randomly pick a bijective mapping (permutation) z : V(Y) —> V(¥), and design 
permutation matrix [Zrp];; = 1 if j = z (i) and [Zrp];; = 0 otherwise. Then, RP-GNN 
adopts the composite model, 


Sre—cnn = E|fup—Gnn © 82Zpp]- (5.11) 


Theorem 5.4. (Theorem 2.2 (Murphy et al, 2019a)) The RP-GNN frp—Gnn is 
strictly more powerful than the original fyp—cnn. 


Computing the expectation E[fizp_aww © 8Zpp| is intractable as one needs to com- 
pute fMP-GNN ° 8Zpp for all possible permutations 2: V(Y) — V(Y). To overcome 
this problem, sampling of Zep may be needed. 

However, as the entire permutation space is too large, uniformly random sam- 
pling of a limited number of Zrp may introduce a large variance. To reduce the 
potential variance, Murphy et al (2019a) also proposed to sample all z’s that per- 
mute only a small subset of nodes instead of the entire set of nodes. More recently, 
Chen et al (2020q) further adapted RP-GNN to solve the subgraph counting prob- 
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lem. They suggest to use all 7’s that permute all the nodes of each connected local 
subgraph. 


5.4.2.2 Random Graph Isomorphic Network (rGIN) (Sato et al, 2021) 


Sato et al (2021) generalized RP-GNN by setting the additional attributes of each 
node sampled from an almost uniform discrete probability distribution. The key 
difference from RP-GNN is that the additional attributes of two nodes are set to 
be independent of each other (while in RP-GNN, one-time random attributes of 
different nodes are correlated due to the nature of permutation). We use Z, to denote 
Z used in rGIN and |Z,], to denote the attributes of node v. For example, they set 


froin = El fur—cnn © gz], where [Z,], ~ Unif(S)i.i.d. Y v E V[Y), 


where E indicates expectation and 2 is a discrete space with at least 1/p elements 
for some p > 0. Similar to RP-GNN, f,gıy can be implemented by sampling only 
a few Z,’s for each evaluation of fiyp_Gwn © gz (indeed, one Z, is sampled per for- 
warding evaluation (Sato et al, 2021)). 


Theorem 5.5. (Theorem 4.1 (Sato et al, 2021)) Consider a GRL example (G,v), 
where only a single node is contained in the node set of interest. For any graph- 
structured features (G',v'), where the nodes of G' have a bounded maximal degree 
and the attributes X come from a finite space, then there exist an MP-GNN , such 
that: 


1. A(G v) = (Y,v), fup_onn © 8z,(G',v’) > 0.5 with high probability. 
2. (Gv) F(G,v), fup_cnn ° 82z,(G',v’) < 0.5 with high probability. 


This result can be viewed as a characterization of the expressive power of rGIN. 
However, this result is lessened by the fact that almost all nodes of all graphs will 
be associated with different representations within two iterations of the 1-WL test 
(so is MP-GNN) (Babai and Kucera, 1979). Moreover, the isomorphism problem of 
graphs with a bounded degree is known to be in P (Fortin, 1996). Instead, a very 
recent work was able to demonstrate the universal approximation of rGIN, which 
gives a stronger characterization of the expressive power of rGIN. 


Theorem 5.6. (Theorem 4.1 (Abboud et al, 2020)) Consider any invariant mapping 
f°: Gr —> R, where GY, contains all graphs with n nodes. Then, there exists a rGIN 
ÍMP-GNN © 82, such that 


P(|fur—cnn © gz, — f*| < €) > 1— ô, for some given € > 0, 6 € (0,1). 


The above RP-GNN and rGIN adopt random attributes that are totally agnostic 
to the input data (¥,S). Instead, the next two methods inject random attributes that 
leverage the input data. Particularly, these random attributes are related to the po- 
sition/location of a node in the graph, which tends to counter the loss of positional 
information of nodes in MP-GNN. 


5 The Expressive Power of Graph Neural Networks 83 


5.4.2.3 Position-aware GNN (PGNN) (You et al, 2019) 


You et al (2019) demonstrated that MP-GNN may not capture the position/loca- 
tion of a node in the graph, which is critical information for applications such as 
link prediction. Therefore, they proposed to use node positional embeddings as ex- 
tra attributes. To capture permutation invariance in the sense of expectation, node 
positional embeddings are generated based on randomly selected anchor node sets. 
We denote the random attributes adopted in PGNN as Zp, which is constructed as 
follows. Considering a graph Y = (V,&,X), 


1. Randomly select a few anchor sets (5), 52,...,5x), where Sy C Y. Note that the 
choice of S% is agnostic to the node identities: given a k, Są will include each 
node with the same probability. 

2. For some u € G, set [Zp], = (d(u,S1),...,d(u,Sx)) where d(u, Sx), k € [K] is a 
distance metric between u and the anchor set S,. 


As the selection of the anchor sets is agnostic to node identities, the obtained Zp still 
satisfies the property 2 in Proposition 5.1. Next, we specify the strategy to sample 
these anchor sets and the choice of the distance metric. The primary requirement to 
select those anchor sets is to keep low distortion of the two distances between nodes, 
where one distance is given by the original graph and the other one is given by those 
anchor sets. Specifically, distortion measures the faithfulness of the embeddings in 
preserving distances when mapping from one metric space to another metric space, 
which is defined as follows: 


Definition 5.9. Given two metric spaces (Vd) and (&,d"') anda function Zp : VW + 
&, Zp is said to have distortion & if Vu,v € Y, 4d(u,v) < d'([Zp]u, [Zplv) <d (u,v). 


Fortunately, Bourgain (1985) showed the existence of a low distortion embedding 
that maps from any metric space to the l, metric space: 


Theorem 5.7. (Bourgain’s Theorem (Bourgain, 1985)) Given any finite metric 
space (Vd) with |Y | =n, there exists an embedding of (V ,d) into RË under any 
l, metric, where K = O(log”n), and the distortion of the embedding is O(logn). 


Based on a constructive proof of Theorem 5.7, Linial et al (1995) provide an 
algorithm to construct an O(log’ n) dimensional embedding via anchor sets. This 
yields the selection of anchor sets and the definition of the distance metric to define 
Zp, which are adopted by PGNN (You et al, 2019). 

By selecting K = clog’n, many random sets S;; C ¥,i = 1,2,...,logn,j = 
1,2,...,clogn, where c is a constant, S; ; is chosen by including each point in ¥ 
independently with probability ne We further define 


d(v,S11) d(v,S12) AV Stogn.ctogn) ) 


[Ze] = ( k , k Jos k (5.12) 


where d(v, Si j) = minyes; ; d (v,u). Then, Zp is an embedding method that satisfies 
Theorem 5.7. 
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Compared with RP-GNN and rGIN, the random attributes adopted by PGNN 
deal specifically with the positional information of a node in graph. Therefore, 
PGNN is better for the tasks that are directly related to the positions of nodes, 
e.g., link prediction. You et al (2019) did not provide a mathematical character- 
ization of the representation power of PGNN. However, the way to establish Zp 
allows that for the two nodes u, v, the attributes [Zp],, and [Zp], by definition are 
statistically correlated. As for the example in Fig. 5.9, such correlation gives PGNN 
the information that the distance between Lynx and Pelagic Fish is different from 
the distance between Orca and Pelagic Fish, and thus may successfully distinguish 
(G, {Lynx, Pelagic Fish}) and (G, {Orca, Pelagic Fish}) and making the right link 
prediction. 

Note that the original PGNN (You et al, 2019) does not use MP-GNN as the 
backbone to perform message passing. Instead, PGNN allows message passing from 
nodes to anchor sets. As such, this approach is not directly relevant to the expressive 
power and is thus out of the scope of this chapter, so we will not discuss it in detail. 
Interested readers may refer to the original paper (You et al, 2019). 


5.4.2.4 Randomized Matrix Factorization (Srinivasan and Ribeiro, 
2020a)(Dwivedi et al, 2020) 


Srinivasan and Ribeiro (2020a) recently made an important observation that node 
positional embeddings obtained via the factorization of some variants of the adja- 
cency matrix A can be used as node attributes as long as certain random perturbation 
is allowed. The obtained models still keep permutation invariance in expectation. 
Srinivasan and Ribeiro (2020a) argue that a model that is built upon these random 
perturbed node positional embeddings is still inductive and holds good general- 
ization properties. This significant observation challenges the traditional claim that 
models built upon these node positional embeddings are not inductive. A high-level 
idea of why this is true is as follows: suppose the SVD decomposition of the adja- 
cency matrix A= UXU T When we permute the order of nodes, that is, the row and 
column orders of A, the row order of U will be changed simultaneously. Therefore, 
the models that use U as the node attributes should keep the permutation invariance. 
That randomly perturbed factorization is needed because such SVD decomposition 
is not unique. 

Although Srinivasan and Ribeiro (2020a) proposed this idea, they did not explic- 
itly compute the node positional embeddings via matrix factorization. Instead, their 
method samples a series of Gaussian random matrices Zg |,ZG2,... and let them 
propagate over the graph, e.g., for the two hops, 


Zg = w(Ay(AZ@1) +ZG2), 


where y’s are MLPs and A indicates some variant of the adjacency matrix. The rows 
of Zg essentially give rough node positional embeddings. Then, these obtained node 
embeddings are further used as the attributes of nodes in MP-GNN. 
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Dwivedi et al (2020) indeed adopted matrix factorization explicitly. They pro- 
posed to use the randomly perturbed Laplacian eigenmaps as the additional at- 
tributes. Specifically, suppose the normalized Laplacian matrix is defined as 


L=1-D PAD", 


where D is the diagonal degree matrix. Denote the eigenvalue decomposition of L 
as L=UZU". The eigenvalue decomposition is not unique, so we assume that U 
can be arbitrarily chosen from all the potential choices. Fortunately, if there are no 
multiple eigenvalues, this U is unique for each column up to a + sign. Then, we 
may directly set the extra node attributes as 


Zig = UT, where I} ~ Unif({—1,1}) iid. Vi € [|V|], Gj =0, Yi# j, (5.13) 


where I" is a diagonal matrix where diagonal elements are uniformly independently 
set as 1 or —1. Here, U can be replaced with a few slices of the columns of U. 
Let gz,, denote the operation to concatenate these additional attributes Zzg with 
the original node attributes. Then, the overall composite model becomes fyp_cGnn © 
&z,,- The following lemma shows that the permutation invariance of fMP-GNN °8z;¢ 
in expectation if the Laplacian matrices hold distinct eigenvalues: 


Lemma 5.3. If (20) S) = (G2), S@) and if there are no multiple eigenvalues of 
their corresponding normalized Laplacian matrices, then any choice of eigenvalue 
decomposition to obtain node embeddings will yield 


“(fur ann 871 (2 ®,S®)) = o( fup GNN © 8zi (2 O, S®)). 


Proof. The proof can be easily seen from the above arguments. 


As shown in Lemma 5.3, the composite model keeps permutation invariance 
in expectation for most graphs, although it may break invariance in some corner 
cases. Regarding the expressive power, Zzg associates different nodes with distinct 
attributes because U is an orthogonal matrix by definition. Hence, there must exist 
JMP-GNN © 8z,, that may distinguish any node subsets from the graph: 


Theorem 5.8. For any two GRL examples (Y,S")), (Y,S)) over the same graph 
G, even if they are isomorphic, as long as si) x S), there exists an JSup—cwn such 
that fup—GNN ° 821 ¢ (G,S) # fup-GNnn °8zi¢(F, Ss), However, if those two GRL 
examples are indeed isomorphic (G,S“)) = (G,S) over the same graph $ and 
the normalized Laplacian matrix of $ has no multiple same-valued eigenvalues, 
then E( fup—cnw ° 8z; (G ,S®)) =E(fup—Gwn © 8212 (Y,S)). 


Proof. The proof can be easily seen from the above arguments. 


Theorem 5.8 implies the potential of fyyp_Gnn °gz,, to distinguish different node 
sets from the same graph. Note that although fmP-GNN ° gz,, achieves great rep- 
resentation power, it does not always work very well for link prediction in prac- 
tice (Dwivedi et al, 2020) when compared with another model SEAL (Zhang and 
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Chen, 2018b) (compare their performance on the COLLAB dataset in (Hu et al, 
2020b)). SEAL is based on the deterministic distance attributes that are introduced 
in the next subsection. Whether a model is permutation invariant is a much weaker 
statement on characterizing the generalization of the model. Actually, when the 
model is paired node positional embeddings, the dimension of the parameter space 
increases, and thus also negatively impacts the generalization. A comprehensive in- 
vestigation of this observation is left for future study. 

In the next subsection, we will introduce deterministic node distance attributes, 
which provide a different angle to solve the above problem. Distance encoding has 
a solid mathematical foundation and provides the theoretical support for many em- 
pirically well-behaved models such as SEAL (Zhang and Chen, 2018b) and ID- 
GNN (You et al, 2021). 


5.4.3 Injecting Deterministic Distance Attributes 


In this subsection, we will introduce an approach that boosts the expressive power 
of MP-GNN by injecting deterministic distance attributes. 

The key motivation behind the deterministic distance attributes is as follows. In 
Section 5.4.1, we have shown that MP-GNN is limited in its ability to measure the 
distances between different nodes, to count cycles*, and to distinguish attributed 
regular graphs. All of these limitations are essentially inherited from the 1-WL 
test which does not capture distance information between the nodes. If MP-GNN 
is paired with some distance information, then the composite model must achieve 
more expressive power. Then, the question is how to inject the distance information 
properly. 

There are two important pieces of intuition to design such distance attributes. 
First, the effective distance information is typically correlated with the tasks. For 
example, consider a GRL example (¥,S). If this task is node classification (|S| = 1), 
the information of distance from this node to itself (thus the cycles containing this 
node) is relevant because it measures the information of the contextual structure. If 
the task is link prediction (|S| = 2), the information of distance between the two end 
nodes of the link is relevant as two nodes near to each other in the network tend 
to be connected by a link. For graph-level prediction (S = VW (¥)), the information 
of distances between any pairs of nodes could be relevant as it can be viewed as a 
group of link predictions. Second, besides the distance between the nodes in S, the 
distance from S to other nodes in G may also provide useful side-information. Both 
two aspects inspire the design of distance attributes. 

There have been a few empirically successful GNN models that leverage deter- 
ministic distance attributes, although their impact on the expressive power of GNNs 


3 Cycles actually carry a special type of distance information, as they describe the length of walks 
from one node to itself. If the distance from one node to itself is not measured by the shortest path 
distance but by the returning probability of random walk, this distance already contains the cycle 
information. 
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has not been characterized until very recently (Li et al, 2020e). For link prediction, 
Li et al (2016a) first consider annotating the two end-nodes of the link of interest. 
These two end-nodes are annotated with one-hot encodings and all other nodes are 
annotated by zeros. Such annotations can be transformed into distance information 
via GNN message passing. Again for link prediction, Zhang and Chen (2018b) first 
sample an enclosing subgraph around the queried link and then annotate each node 
in this subgraph with one-hot encodings of the shortest path distances (SPDs) from 
this node to the two end-nodes of the link. Note that deciding whether a node is in 
the enclosing subgraph around the queries link already gives a distance attribute. 
Zhang and Chen (2019) uses a similar idea to perform matrix completion which is 
a similar task to link prediction. For graph classification and graph-level property 
prediction, Chen et al (2019a) and Maziarka et al (2020a) adopt the SPDs between 
two nodes as edge attributes. These edge attributes can be also used as the input of 
MSG (Eq.equation 11.45) in MP-GNN. You et al (2021) annotates a node as 1 and 
other nodes as 0 to improve MP-GNN in node classification. As our focus is on the 
theoretical characterization of the expressive power, we will not go into detail about 
these empirically successful works. Interested readers are referred to the relevant 
papers. 


Remark 5.7. (Comparison between deterministic distance attributes and random at- 
tributes) Deterministic distance attributes have some advantages. First, as there is 
no randomness in the input attributes, the optimization procedure of the model con- 
tains less noise. Hence, the training procedure tends to converge much faster than 
the model with random attributes. The model evaluation performance contains much 
less noise too. Some empirical evaluation of the convergence of the model training 
with random attributes can be found in Abboud et al (2020). Second, a model based 
on deterministic distance attributes typically shows better generalization in practice 
than the one based on random attributes. Although theoretically a model is permuta- 
tion invariant when being trained based on sufficiently many examples with random 
attributes (as discussed in Sec.5.4.2), in practice, this could be hard to achieved due 
to the high complexity. 

Deterministic distance attributes have some disadvantages. First, models that are 
paired with deterministic attributes may never achieve the universal approxima- 
tion, unless the graph isomorphism problem is in P. However, random attributes 
may be universal in the probabilistic sense (e.g., Theorem 5.6). Second, determin- 
istic distance attributes typically depend on the information S in a GRL example 
(Y,S). This introduces an issue in computation: that is, if there are two GRL ex- 
amples (¥"!),§“)) and (Y2),S@)) sharing the same graph Y but with different 
node sets of interest S(!) #S (2), they will be attached with different deterministic 
distance attributes and hence GNNs have to make inference over them separately. 
However, GNNs with random attributes can share intermediate node representations 


{hE Ny € V|[¥Y]} in Eq.equation 5.4, between the two GRL examples, which saves 
intermediate computation. 
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5.4.3.1 Distance Encoding (Li et al, 2020e) 


Suppose we aim to make prediction for a GRL example (¥,S). Li et al (2020e) 
defined distance encoding ¢ (u|S) as an extra node attribute for node u € V [9]. 


Definition 5.10. For a GRL example (¥,S) where Y = (A,X). Distance encoding 
¢ (u|S) for node u is defined as follows 


C(u|S) = )) MLP(¢ (ulv)) (5.14) 


ves 


where ¢ (u|v) charaterizes a certain distance between u and v. We may choose 
C (ulv) = (liw), Ci = (1, (Wu; (WP)uv, i (Ww, seh) (5.15) 


where W = AD”! is the random walk matrix and g(-) is a general function that maps 
Luy to different types of distance measures. 


Note that ¢(u|S) depends on the graph structure ¥, which we omit in our notation 
for simplicity. First, setting g(¢,,) as the first non-zero position in „y gives the 
shortest-path-distance (SPD) from v to u. Second, setting g(lu) as follows gives 
generalized PageRank scores (Li et al, 2019f): 


Cepr(ulv) = > (Wu = 99 Ww, W € R, for all k € Z>o . (5.16) 
kel k>0 


Different choices of {%|k € Z>o} yield various distance measures between u and v. 


Personalized PageRank scores (Jeh and Widom, 2003): = awe (0,1), 


Heat-kernel PageRank scores (Chung, 2007): yp = B*e~8 /k!, B > 0, 
Inverse hitting time (Lovász et al, 1993): Y% = k. 


It is important to see that the above definition of distance encoding satisfies per- 
mutation invariance. 


Lemma 5.4. For two isomorphic GRL examples (4), S\) & (G2), S), ifau) = 
n(v) foruEeV(Y] andv€ V|), their distance encodings are equal € (u|S\)) = 
C(v|s). 


Proof. The proof can be easily seen by the definition of distance encoding. 


Li et al (2020e) considers using distance encoding as node extra attributes. 
Specifically, MP-GNN can be improved by setting X, = X, @ €(v|S) ,where @ is 
the concatenation. The obtained model is termed DE-GNN, denoted as fpg. 

DE-GNN has been shown to be more powerful than MP-GNN. Recall that the 
fundamental limit of MP-GNN is the 1-WL test for graph representation learn- 
ing problems (Theorem 5.2). Corollary 5.1 further indicates that attributed regu- 
lar graphs may not be distinguished by MP-GNN under certain scenarios. Li et al 
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Fig. 5.13: Distance encoding can be used to distinguish non-isomorphic graph- 
structured examples. In the example of node classification, we consider classi- 
fying nodes based on their roles in their contextual structures, termed structural 
roles (Henderson et al, 2012). Nodes in S; and S2 have different structure roles. 
However MP-GNN with two layers will confuse these two nodes; while with dis- 
tance encoding, DE-GNN may distinguish them. In the example of link prediction, 
although two nodes {Lynx, G} and {Orca, G} are isomorphic (where we ignore the 
node identities), distance encoding on the node Seal allows us to distinguish node 
pairs {Orca, Pegagic Fish} and {Lynx, Pegagic Fish}. 


(2020e) considers the scenario when the graphs are regular and do not have attributes 
and proved that DE-GNN can distinguish two GRL examples with high probability, 
which is formally stated in the following theorem. 


Theorem 5.9. (Theorem 3.3 (Li et al, 2020e)) Consider two GRL examples (Gs) 
and (G2), s) where GY") and Y) are two n-sized unattributed regular graphs, 
and \S‘)| = |S)| is a constant (independent of n). Suppose GY) and GY are 
uniformly independently sampled from all n-sized r-regular graphs where 3 < r < 
(2log n)! Then, for any small constant € > 0, there exists DE-GNN with certain 


weights within L< [(4+€) menl layers that can distinguish these two examples 


with high probability. Specifically, the outputs fpg ((2®,S®)) F for ((G™,S)) 
with probability 1 — o(n™!). The specific form of DE, i.e., g in Eq.equation 5.15, 
can be simply chosen as short path distance. The little-o notation here and later are 
w.r.t. n. 


Theorem 5.9 focuses on the node sets of unattributed regular graphs. We con- 
jecture that the statement can be generalized to attributed regular graphs as distinct 
attributes can only further improve the distinguishing power of a model. Moreover, 
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the assumption on regularity of graphs is also not crucial because the 1-WL test or 
MP-GNN may transform all graphs, attributed or not, into attributed regular graphs 
with enough iterations (Arvind et al, 2019). 

Of course, DE-GNN may not distinguish any non-isomorphic GRL examples. 
Li et al (2020e) introduce the limitation of DE-GNN. Particularly, DE-GNN cannot 
distinguish nodes of distance regular graphs with the same intersection arrays, al- 
though DE-GNN may distinguish their edges (See Fig. 5.14 later). Li et al (2020e) 
also generalize the above results to the case that leverages distance attributes as 
edge attributes (to control message aggregation in MP-GNN). Interested readers 
can check the details in their original paper. 


5.4.3.2 Identity-aware GNN (You et al, 2021) 


As a concurrent work with DE-GNN, You et al (2021) studied a special type of 
distance encoding to improve the node representations learnt by MP-GNN. Specifi- 
cally, when MP-GNN is adopted to compute the representation of node v, You et al 
(2021) suggests attaching each node u in the graph with an extra binary attribute 
Crp(u|{v}) to indicate the identity of node v where 


goth = {5 ea 6.17 
MP-GNN that leverages Czp(u|{v}) is termed Identity-aware GNN (ID-GNN). 
Cip(ul{v}) is a simple implementation of distance encoding (Eq. equation 5.14) 
when the set $ contains only one node v. Although ID-GNN does not compute 
distance measures as DE-GNN, ID-GNN holds the same representation power as 
DE-GNN for node classification, as the distance information from another node u 
to the target node v can be learnt by ID-GNN via an extra identity attribute. 


Theorem 5.10. For two graph-structured examples (),S\) and (¥,8)), 
where |S| = 1 for i € {1,2} and Y is unattributed, if DE-GNN can distinguish 
them with L layers, then ID-GNN requires at most 2L layers to distinguish them. 


Proof. ID-GNN needs the first L layers to propagate the identity attribute to capture 
distance information and the second L layers to let such information propagate back 
to finally be merged into the node representations. 


Although ID-GNN adopts a specific type of DE to learn node representations, ID- 
GNN was also used to perform graph-level prediction (You et al, 2021). Specifically, 
for every node v in the graph G, ID-GNN attaches | to this node, 0’s to other nodes 
and computes the node representation h,. Iterating over all the nodes, ID-GNN col- 
lects all node representations {h,|v € “(Y)}. Then, by following Eq.equation 5.4 
(S is the entire node set “(Y) here), ID-GNN can aggregate the node representa- 
tions of all the nodes and further make graph-level predictions. Actually, combining 
the statement of Theorem 5.9 and the union bound, Li et al (2020e) indicates the 
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ID-GNN for |S|=2 DE-GNN for |S|=2 


© DE={0, 1} 

DE = {1, 1} 
© DE = {1, 2} 
© DE = {2, 2} 


Fig. 5.14: ID-GNN v.s. DE-GNN makes predictions for a pair of nodes. Two graphs 
are the Shrikhande graph (left) and the 4 x 4 Rook’s graph (right). ID-GNN (black 
nodes attached identities) cannot distinguish node pairs {a,b} and {c,d}. DE-GNN 
may learn distinct representations of {a,b} and {c,d}. In these two graphs, each 
node is colored with its DE that is a set of SPDs to either node in the target node 
sets {a,b} or {c,d} (Eq. equation 5.14). Note that the neighbors of nodes with DE= 
{1,1} (dashed boxes) are enclosed by red ellipses which shows that the neighbors 
of these two nodes have different DE’s. Hence, after one layer of DE-GNN, the 
intermediate representations of nodes with DE= {1,1} are different between these 
two graphs. Using another layer, DE-GNN can distinguish the representations of 
{a,b} and {c,d}. 


expressive power of the above procedure for the entire graph classification problem, 
summarized in the following corollary. 


Corollary 5.2. (Reproduced from Corollary 3.4 (Li et al, 2020e)) Consider two 
GRL examples Y) and Z®. Suppose Y") and Y™ are uniformly independently 
sampled from all n-sized unattributed r-regular graphs where 3 < r < (2log n)!/2, 
Then, ID-GNN with a sufficient number of layers can distinguish these two graphs 
with probability 1 — o(1). The little-o notation here and later are w.rt. n. 


ID-GNN can be viewed as the simplest version of DE-GNN that achieves the 
same expressive power for node-level prediction. However, when the prediction 
tasks contain two nodes, i.e., node-pair-level prediction, ID-GNN will be less pow- 
erful than DE-GNN. 

To make a prediction for a GRL example (¥,S) where |S| = 2, ID-GNN can 
adopt two different approaches. First, ID-GNN can attach the extra identity at- 
tributes to the two nodes in S separately, learn their representations separately and 
combine these two representations to make the final prediction. However, this ap- 
proach cannot capture the distance information between the two nodes in S. Instead, 
ID-GNN uses an alternative approach. ID-GNN attaches the extra identity attribute 
to only one of nodes in S and performs message passing. Then, after a sufficient 
number of layers where the extra node identity is propagated from one node to 
another in S, the distance information between these two nodes can be captured. 
Finally, ID-GNN makes its prediction based on the two node representations in S. 
Note that although the second approach captures the distance information between 
the two nodes in S, it is still less powerful than DE-GNN. One example is shown in 
Fig. 5.14. 
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Up to this point, we have mostly focused on the message passing framework of 
GNNs, which leverages the sparsity of real-world graphs. In the next subsection, we 
remove the need for sparsity and discuss higher-order GNNs. These GNNs essen- 
tially mimic higher-dimensional WL tests and achieve more expressive power. 


5.4.4 Higher-order Graph Neural Networks 


The final collection of techniques for building GNNs, which overcome the limi- 
tation of the 1-WL test, are related to higher-dim WL test. In this subsection, for 
notational simplicity, we focus only on graph-level prediction learning problems, 
where higher-order GNNs are mostly used. 

The family of WL tests forms a hierarchy for the graph isomorphism prob- 
lem (Cai et al, 1992). There are different definitions of the higher-dim WL tests. 
We follow the terminology adopted in Maron et al (2019a) and introduce two types 
of WL tests: the k-forklore WL (k-FWL) test and the k-WL test. 


Recall Y = {AM XO}, i € {1,2} . For both Ys, i € {1,2}, do the following steps. 


1. For each k-tuple of node set Vj = (vj, Vj, Vj) € VS j € [n], we initialize Vj with a 
color denoted by Ce These colors satisfy the condition that for two k-tuples, say V; and 


Vy, gM and EP are the same if and only if: (1) X= Xvi 5 (2) Vja = Yj, SVE =V 
and (3) Via Vj) EES (vj vy) € £ for all a,b € [k]. 

2. k-FWL: For each k-tuple V; and u € V, define Nę—rwz (Vj; u) as a k-tuple of k-tuples, such 
that Nk-rwL(Vj;u) = ((U, Vja -+3 Vje) Vip sth -+ Vje)» (Viz Vja» 4) )- Then the color of V; 


can be updated via the following mapping. 


J: 
Jp? 


Update colors: C)+” + HASH(C\”,{(CY Vy € Ni-rwL(Vju))}uev) (5-18) 


k-WL: For each k-tuple V; and u € VY, define Nk-wz(Vj;u) as a set of k-tuples such that 
Nk-wL(Vj;u) = { (U,V jas -Vje ds (Vja sth; es Vig) ji Yj 4) } Then, the color of V; can 
be updated via the following mapping. 


i 


Update colors: Gay < HASH(Cl OAC lV; E€ Ny_wr(Vj3u)}), (5.19) 


where the HASH operations in both cases guarantee an injective mapping with different 
inputs yielding different outputs. 
3. For each step /, fant jev (GO) is a multi-set. If such multi-sets of the two graphs are 


not equal, return Y") % Y), Otherwise, go to Eq. equation 5.19. 


Similar to the 1-WL test, if the k-(F)WL test returns 9 () £G (2) then it follows that 9 (1) 
4) are not isomorphic. However, the reverse is not true. 


Fig. 5.15: Use k-FWL and k-WL to distinguish Z) and 9. 


The key idea of these higher-dim WL tests is to color every k-tuple of nodes in 
the graphs and update these colors by aggregating the colors from other k-tuples that 
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share k — 1 nodes. The procedures of the k-FWL test and the k-WL test are shown in 
Fig. 5.15. Note that they perform aggregation differently, and as such, have different 
power to distinguish non-isomorphic graphs. These two types of tests form a nested 
hierarchy, as summarized in the following theorem. 


Theorem 5.11. (Cai et al, 1992; Grohe and Otto, 2015; Grohe, 2017) 


1. The k-FWL test and the k + 1-WL test have the same discriminatory power, for 
k>1. 

2. The 1-FWL test, the 2-WL test and the 1-WL test have the same discriminatory 
power. 

3. There are some graphs that the k + 1-WL test can distinguish while the k-WL 
test cannot, for k > 2. 


Because of Theorem 5.11, GNNs that are able to capture the power of these 
higher-dim WL tests can be strictly more powerful than the 1-WL test. Therefore, 
higher-order GNNs have the potential to learn even more complex functions than 
MP-GNN. 

However, the drawback of these GNNs is their computational complexity. By 
the definition of higher-order WL tests, the colors of all k-tuples of nodes need to 
be tracked. Correspondingly, higher-order GNNs that mimic higher-order WL tests 
need to associate each k-tuple with a vector representation. Therefore, their memory 
complexity is at least Q (|V |F), where || is the number of nodes in the graph. The 
computational complexity is at least Q(|Y%|**+!), which makes these higher-order 
GNNs prohibitively expensive for large-scaled graphs. 


5.4.4.1 k-WL-induced GNNs (Morris et al, 2019) 


Morris et al (2019) first proposed k-GNN by following the k-WL test. Specifically, 
k-GNN associates each k-tuple of nodes, denoted by V;, j € “*, with a vector repre- 


sentation that is initialized as h'”). In order to save memory, k-GNN only considers 
k-tuples that contain k different nodes and ignores the order of these nodes. There- 
fore, each k-tuple reduces to a set of k nodes. With some modification of notation in 


this subsection, let 7; denote this set of k different nodes. The initial representation 
(0) (0) 
=h 


of Vj, h® is chosen as a one-hot encoding such that h j =h; 
subgraphs induced by V; and V; are isomorphic. 


Then, k-GNN follows the following update procedure of these representations: 


, if and only if the 


ht) MLPA E hO), Vk-sized node sets Vj, (5.20) 
Vir -Ne—Gnn (Vj) 


where Ny—Gwwn (Vj) = {Vj"| |V; O V;| =k — 1}. Note that Ny_Gww(V;) defines the 
neighbors of V; differently than Nk—-wz (see Eq.equation 5.19), because V; is now a 
k-sized node set instead of a k-tuple. 
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Eq.equation 5.20 has time complexity at least O(|V|*) as the size of Ny_Gwn (Vj) 
is O(|Y|*). Recently, Morris et al (2019) also considers using a local neighbor- 
hood of V; instead of Ny_Gnn(Vj). This local neighborhood only includes Vir € 
Nx—onn (Vj), such that the node in Vi\Vj is connected to at least one node in Vj. 
Morris et al (2020b) demonstrated that a variant of this local version of k-GNN may 
be as powerful as the k-WL test, although a deeper architecture with more layers is 
needed to match the expressive power. 

k-GNN is at most as powerful as the k-WL test. To be more expressive than 
MP-GNN, k = 3 is needed. Therefore, the memory complexity is at least Q (|V |’). 
Subsequently, the computational complexity of k-GNN, even for their local version, 
is at least Q (|V |’) per layer. 


5.4.4.2 Invariant and equivariant GNNs (Maron et al, 2018, 2019b) 


To build higher-order GNNs, every k-tuple needs to be associated with a vector rep- 
resentation. Therefore, regardless whether a local or a global neighborhood aggrega- 
tion is adopted (Eq.equation 5.20), the benefit of reducing the computation by lever- 
aging the sparse graph structure is limited, as it cannot reduce the dominant term 
Q(|VY|*). Moreover, to handle a sparse graph structure, these higher-order GNNs 
also need random memory access, which introduces additional computational over- 
head. Therefore, a line of research into building higher-order GNNs totally ignores 
graph sparsity. Graphs are viewed as tensors and NNs take these tensors as input. 
The NNs are designed to be invariant to the order of tensor indices. 

Many approaches (Maron et al, 2018, 2019a,b; Chen et al, 2019f; Keriven and 
Peyré, 2019; Vignac et al, 2020a; Azizian and Lelarge, 2020) adopt this formulation 
to build GNNs and analyze their expressive power. 


(0) 
J 
that h” € R for simplicity. By concatenating the k-tuple’s representations together, 
we obtain a k-order tensor: 


Each k-tuple V; € V* is associated with a vector representation h\’. We assume 


[YX + x |Y| 
H ER”, where R&I =R ktimes 


Maron et al (2018) investigates linear permutation invariant and equivariant map- 
pings defined on R®*!1. 


Definition 5.11. Given a bijective mapping 7: VY > VY and H € Ræ”, define 


age := H’, where ia hiss) te ec , for all kK-tuples (v1, v2, ..., Vx) 
ev. 


VLV2; Vk) 
Definition 5.12. A mapping g : R®«|”! > R is called invariant, if for any bijective 
mapping 7: V — V and H € R®«!”|, g(H) = 9(x(H)). 


Definition 5.13. A mapping g : R®!”! — R®«|”! is called equivariant, if for any 
bijective mapping x: V + V and H € R®|”!, 2(g(H)) = g(x(A)). 
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Maron et al (2018) showed that the number of the bases needed to represent all 
possible linear invariant mappings from R®«!”! — R is b(k), where b(k) is the k- 
th Bell number. Additionally, the number of bases, needed to represent all possible 
linear equivariant mappings from R®!”! — R®wIY|, is b(k +k’). To better under- 
stand this observation, consider the invariant case with k = 1. In this case, the linear 
invariant mapping g : RI”! > R is essentially an invariant pooling (Def.5.7). As 
b(1) = 1, the linear invariant mapping g : R” >R only holds one single base — 
the sum pooling, i.e., g follows the form g(a) = c(1,a), where c is a parameter to 
be learned. Consider the equivariant case, where k = 1 and k’ = 1. As b(2) = 2, 
the linear equivariant mapping g : R!”! — RI”! holds two bases, i.e., g has the form 
g(a) = (c11 +c211! )a, where c;,c are parameters to be learned. 

Based on the above observations, GNNs can be built by compositing these linear 
invariant/equivariant mappings. Learning can be performed via learning the weights 
before the above bases. Towards this end, Maron et al (2018, 2019a) has proposed 
using these linear invariant/equivariant mappings to build GNNs: 


—1 1 
frim = Sinv 080608, 00- 000g), (5.21) 


where gim is a linear invariant layer Red”! 5 R, g's, L € [L] are linear equivari- 
ant layers from Rl”! — R&I, and o is an element-wise non-linear activation 
function. It can be shown that f,_jn, is an invariant mapping. Maron et al (2018); 
Azizian and Lelarge (2020) proved that the connection of fk—iny to the k-WL test 
can be summarized with the following theorem. 


Theorem 5.12. (Reproduced from (Maron et al, 2018; Azizian and Lelarge, 2020)) 
For two non-isomorphic graphs @') 2 GY), if the k-WL test can distinguish them, 
then there exists fk—in that can distinguish them. 


Maron et al (2019b); Keriven and Peyré (2019) also studied whether the models 
Jk—inv May universally approximate any permutation invariant function. However, 
they were pessimistic in their conclusion since this would require high-order tensors, 
k = Q (n), which can hardly be implemented in practice (Maron et al, 2019b). 

Similar to k-GNN, fin is also at most as powerful as the k-WL test. To be more 
expressive than MP-GNN, fin, should use at least k = 3. Therefore, the memory 
complexity is at least Q(|Y|?). Then, the number of bases of the linear equivariant 
layer is b(6) = 203. Therefore, the computation at each layer follows that: (1) a 
tensor in R®3!”! times b(6) many tensors in R®s!”! get b(6) many tensors in R®3!”1; 
(2) these tensors get summed into a single tensor in R2311, 


5.4.4.3 k-FWL-induced GNNs (Maron et al, 2019a; Chen et al, 2019f) 


The higher-order GNNs in previous two subsections match the expressive power of 
the k-WL test. According to Theorem 5.11, the k-FWL test holds the same power 
as the k + 1-WL test, which is strictly more powerful than the k-WL test for k > 2, 
while the k-FWL test only needs to track the representations of k-tuples. Therefore, 
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if GNNs can mimic the k-FWL test, they may hold similar memory cost as the GNNs 
introduced in the previous two subsections while being more expressive. Maron et al 
(2019a); Chen et al (2019f) proposed PPGN and Ring-GNN respectively to match 
the k-FWL test. 

The key difference between the k-FWL test and the k-WL test is the leverag- 
ing of the neighbors of a k-tuple V;. Note that Nx_pw1(Vj;u) in Eq.equation 5.18 
groups the neighboring tuples of V; into a higher-level tuple, while Nx_wz(Vj;u) 
skips grouping them due to the set union operation in Eq.equation 5.19. This yields 
the key mechanism in the GNN design to match the k-FWL test: implement the ag- 
gregating procedure in the k-FWL test of Eq.equation 5.18 via a product-sum pro- 
cedure. Suppose the representation for V; is h; € R. We may design the aggregation 


of {icy |V; © Ne—rw (Vj) kuev as 


II hy. 


ueV Vy ENk-rwL(Vj;u) 


If we combine all these representations into a tensor H € R®lV |X, the above oper- 
ation can essentially be represented as tensor operation, i.e., define 


A= Ha, O Hu, O OH. u, where 
uEV 


: H, ai poe Hye g 
vi, Vins Vig =} UV igs Vip” Ay, My Vip Vip Vigo ll 
ueV 


H’) 


Based on the above observation, Maron et al (2019a) built PPGN as follows. 
First, for all V; € W*, initialize h” € R such that h?’ = h, if and only if: (1) 
Xvi, 5X (2) Vja = Vj, © vy, = Vj; and (3) Wa i) EES CANTA) € &, for 


all a,b € [k]. Then, combine h? ) 


procedure for / = 0, 1,...,L— 1: 


ja 


into a tensor H) € R®«!”!, Perform the updating 


HOY —A%@ | Y Ae) oA) © off, |, 
ucV 


where, A) =MLP\) (HO). (5.22) 


Here, MLPs are imposed at the last dimension of these tensors. MLPs with different 
sup-script have different parameters. Finally, perform a READOUT Yv,evk hy? to 


obtain the graph representation. 
Maron et al (2019a) proved that PPGN, when k = 2, can match the power of the 
2-FWL test. Azizian and Lelarge (2020) generalized this result to an arbitrary k. 


Theorem i 13. ( o gi from (Azizian and Lelarge, 2020)) For two non-isomorphic 
graphs G? ) oe Ge ), if the k-FWL test can distinguish them, then there exists a PPGN 
that can distinguish them. 


5 The Expressive Power of Graph Neural Networks 97 


To be more powerful than the 1-WL test, PPGN only needs to set k = 2 and hence 
the memory complexity is just Q (|⁄ |7). Regarding the computation, the product- 
sum-type aggregation of PPGN is indeed more complex than fin, in Sec.5.4.4.2. 
However, when k = 2, Eq.equation 5.22 reduces to the product of two matrices, 
which can be efficiently computed in parallel computing units. 


5.5 Summary 


Graph neural networks have recently achieved unprecedented success across many 
domains due to their great expressive power to learn complex functions defined 
over graphs and relational data. In this chapter, we provided a systematic study of 
the expressive power of GNNs by giving an overview of recent research results in 
this field. 

We first established that the message passing GNN is at most as powerful as the 
1-WL test to distinguish non-isomorphic graphs. The key condition that guaran- 
tees to match the limit is an injective updating function of node representations. 
Next, we discussed techniques that have been proposed to build more powerful 
GNNs. One approach to make message passing GNNs more expressive is to pair 
the input graphs with extra attributes. In particular, we discussed two types of extra 
attributes — random attributes and deterministic distance attributes. Injecting ran- 
dom attributes allows GNNs to distinguish any non-isomorphic graphs, though a 
large amount of data augmentation is required to make GNNs approximately invari- 
ant. Meanwhile, injecting deterministic distance attributes does not require the same 
data augmentation, but the expressive power of the resulting GNNs still holds certain 
limitations. Mimicking higher-dimensional WL tests is another way to build more 
powerful GNNs. These approaches do not track node representations. Instead, they 
update the representation of every k-tuple of nodes (k > 2). Overall, the message 
passing GNN is powerful but holds some limits in its expressive power. Different 
techniques make GNNs overcome these limits to a different extent while incurring 
different types of computational costs. 

We would like to list some additional research results on the expressive power 
of GNNs that we were not able to cover earlier due to space limitations. Barceló 
et al (2019) study the expressive power of GNNs to represent Boolean classifiers, 
which is useful to understand how GNNs represent knowledge and logic. Vignac 
et al (2020a) propose a structural message passing framework for GNNs, where a 
matrix instead of a vector is adopted as the node representation to make GNN more 
expressive. Balcilar et al (2021) studied the expressivity of GNNs via the spectral 
analysis of GNN-based graph signal transformations. Chen et al (2020k) studies the 
effect of non-linearity of GNNs in the message passing procedure on their expres- 
sive power, which complements our understanding of many works that suggest a 
linear message passing procedure (Wu et al, 2019a; Klicpera et al, 2019a; Chien 
et al, 2021). 
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Theoretical characterization of GNNs is an important research direction, where 
the analysis of expressive power is only one of its aspects, perhaps the best-studied 
up to this point. Machine learning models hold two fundamental blocks, training and 
generalization. However, only a few research works have analyzed them (Garg et al, 
2020; Liao et al, 2021; Xu et al, 2020c). The authors suggest that future research 
on building more expressive GNNs always takes these two blocks into account. 
A related, significant question is how to build more expressive GNNs with only a 
limited depth and width*. Note that limiting the model depth and width yields the 
potential of more efficient GNN training and better generalization. To conclude this 
chapter, let us quote Sir Winston Churchill:“‘Now this is not the end. It is not even the 
beginning of the end. But it is, perhaps, the end of the beginning.” We have strong 
confidence that the machine learning community will put more effort on theory for 
GNNs in the future to match their success and break their encountered difficulties 
in real-world applications. 
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Editor’s Notes: The theoretical analysis of expressive power reveals how 
the architecture of GNNs works and gains its advantage. Hence it provides 
support for readers to understand the great success of GNNs in fundamental 
graph learning tasks, e.g. link prediction (chapter 10) and graph matching 
(chapter 13), various downstream tasks, e.g. recommender system (chapter 
19) and natural language processing (chapter 21), as well as its relevance 
with other GNNs’ characterizations, e.g. scalability (chapter 6) and robust- 
ness (chapter 8). Inspired by these theories, it’s also probably to motivate 
the study of preferable GNN models that can break through unsolved chal- 
lenges in existing problems, such as graph transformation (chapter 12) and 
drug discovery (chapter 24). 


4 Loukas (2020) measures the required depth and width of GNNs by viewing them as distributed 
algorithms, which does not assume permutation invariance. Instead, here we are talking about the 
expressive power that refers to the capability of learning permutation invariant functions. 
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Abstract Over the past decade, Graph Neural Networks have achieved remarkable 
success in modeling complex graph data. Nowadays, graph data is increasing expo- 
nentially in both magnitude and volume, e.g., a social network can be constituted 
by billions of users and relationships. Such circumstance leads to a crucial question, 
how to properly extend the scalability of Graph Neural Networks? There remain 
two major challenges while scaling the original implementation of GNN to large 
graphs. First, most of the GNN models usually compute the entire adjacency matrix 
and node embeddings of the graph, which demands a huge memory space. Second, 
training GNN requires recursively updating each node in the graph, which becomes 
infeasible and ineffective for large graphs. Current studies propose to tackle these 
obstacles mainly from three sampling paradigms: node-wise sampling, which is ex- 
ecuted based on the target nodes in the graph; layer-wise sampling, which is im- 
plemented on the convolutional layers; and graph-wise sampling, which constructs 
sub-graphs for the model inference. In this chapter, we will introduce several repre- 
sentative research accordingly. 
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6.1 Introduction 


Graph Neural Network (GNN) has gained increasing popularity and obtained re- 
markable achievement in many fields, including social network (Freeman, 2000; 
Perozzi et al, 2014; Hamilton et al, 2017b; Kipf and Welling, 2017b), bioin- 
formatics (Gilmer et al, 2017; Yang et al, 2019b; Ma et al, 2020a), knowledge 
graph (Liben-Nowell and Kleinberg, 2007; Hamaguchi et al, 2017; Schlichtkrull 
et al, 2018), etc. GNN models are powerful to capture accurate graph structure in- 
formation as well as the underlying connections and interactions between nodes (Li 
et al, 2016b; Veličković et al, 2018; Xu et al, 2018a, 2019d). Generally, GNN models 
are constructed based on the features of the nodes and edges, as well as the adja- 
cency matrix of the whole graph. However, since the graph data is growing rapidly 
nowadays, the graph size is increasing exponentially too. Recently published graph 
benchmark datasets, Open Graph Benchmark (OGB), collects several commonly 
used datasets for machine learning on graphs (Weihua Hu, 2020). Table 6.1 is the 
statistics of the datasets about node classification tasks. As observed, large-scale 
dataset ogbn-papers100M contains over one hundred million nodes and one billion 
edges. Even the relatively small dataset ogbn-arxiv still consists of fairly large nodes 
and edges. 


Table 6.1: The statistics of node classification datasets from OGB (Weihua Hu, 
2020). 


Scale Name Number of Nodes Number of Edges 
Large ogbn-papers 100M 111,059,956 1,615,685,872 
Medium ogbn-products 2,449,029 61,859,140 
Medium ogbn-proteins 132,534 39,561,252 
Medium ogbn-mag 1,939,743 21,111,007 
Small ogbn-arxiv 169,343 1,166,243 


For such large graphs, the original implementation of GNN is not suitable. There 
are two main obstacles, 1) large memory requirement, and 2) inefficient gradient 
update. First, most of the GNN models need to store the entire adjacent matrices 
and the feature matrices in the memory, which demand huge memory consumption. 
Moreover, the memory may not be adequate for handling very large graphs. There- 
fore, GNN cannot be applied on large graphs directly. Second, during the training 
phase of most GNN models, the gradient of each node is updated in every iteration, 
which is inefficient and infeasible for large graphs. Such scenario is similar with the 
gradient descent versus stochastic gradient descent, while the gradient descent may 
take too long to converge on large dataset, and stochastic gradient is introduced to 
speed up the process towards an optimum. 

In order to tackle these obstacles, recent studies propose to design proper sam- 
pling algorithms on large graphs to reduce the computational cost as well as increase 
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the scalability. In this chapter, we categorize different sampling methods according 
to the underlying algorithms, and introduce typical works accordingly. 


6.2 Preliminary 


We first briefly introduce some concepts and notations that are used in this chapter. 
Given a graph 9 (V, £), |Y | =n denotes the set of n nodes and |&| = m denotes a 
set of m edges. Node u E€ M (v) is the neighborhood of node v, where v € Y, and 
(u,v) € &. The vanilla GNN architecture can be summarized as: 


n/t) =o (anOw!) i 


where A is the normalized adjacency matrix, h” represents the embedding of the 
node in the graph for layer/depth /, WwW") is the weight matrix of the neural network, 
and o denotes the activation function. 

For large-scaled graph learning, the problem is often referred as the node classi- 
fication, where each node v is associated with a label y, and the goal is to learn from 
the graph and predict the labels of unseen nodes. 


6.3 Sampling Paradigms 


The concept of sampling aims at selecting a partition of all the samples to represent 
the entire sample distribution. Therefore, the sampling algorithm on large graphs 
refers to the approach that uses partial graph instead of the full graph to address 
target problems. In this chapter, we categorize different sampling algorithms into 
three major groups, which are node-wise sampling, layer-wise sampling and graph- 
wise sampling. 

Node-wise sampling plays a dominant role during the early stage of imple- 
menting GCN on large graphs, such as Graph SAmple and aggreGatE (Graph- 
SAGE) (Hamilton et al, 2017b) and Variance Reduction Graph Convolutional 
Networks (VR-GCN) (Chen et al, 2018d). Later, layer-wise sampling algorithms 
are proposed to address the neighborhood expansion problem occurred during 
node-wise sampling, e.g., Fast Learning Graph Convolutional Networks (Fast- 
GCN) (Chen et al, 2018c) and Adaptive Sampling Graph Convolutional Networks 
(ASGCN) (Huang et al, 2018). Moreover, graph-wise sampling paradigms are de- 
signed to further improve the efficiency and scalability, e.g., Cluster Graph Convo- 
lutional Networks (Cluster-GCN) (Chiang et al, 2019) and Graph SAmpling based 
INductive learning meThod (GraphSAINT) (Zeng et al, 2020a). Fig. 6.1 illustrates 
a comparison between three sampling paradigms. In the node-wise sampling, the 
nodes are sampled based on the target node in the graph. While in the layer-wise 
sampling, the nodes are sampled based on the convolutional layers in the GNN 


102 Hehuan Ma, Yu Rong, and Junzhou Huang 


(a) Node-wise. (b) Layer-wise. 


Sampling S> Fali 
» >| ann 


(c) Graph-wise. 


Fig. 6.1: Three sampling paradigms toward large-scale GNNs. 


models. For the graph-wise sampling, the sub-graphs are sampled from the original 
graph, and used for the model inference. 

According to these paradigms, two main issues should be addressed while con- 
structing large-scale GNNs: 1) how to design efficient sampling algorithms? and 2) 
how to guarantee the sampling quality? In recent years, a lot of works have studied 
about how to construct large-scale GNNs and how to address the above issues prop- 
erly. Fig. 6.2 displays a timeline of certain representative works in this area from the 
year 2017 until recent. Each work will be introduced accordingly in this chapter. 


GraphSAGE FastGCN 
Node-wise Layer-wise 


Cluster-GCN 
Graph-wise 


2017 2020 


VR-GCN ASGCN GraphSAINT 
Node-wise Layer-wise Graph-wise 


Fig. 6.2: Timeline of leading research work toward large-scale GNNs. 


Other than these major sampling paradigms, more recent works have attempted 
to improve the scalability of large graphs from various perspectives as well. For 
example, heterogeneous graph has attracted more and more attention with regards 
to the rapid growth of data. Large graphs not only include millions of nodes but 
also various data types. How to train GNNs on such large graphs has become a new 
domain of interest. Li et al (2019a) proposes a GCN-based Anti-Spam (GAS) model 
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to detect spams by considering both homogeneous and heterogeneous graphs. Zhang 
et al (2019b) designs a random walk sampling method based on all types of nodes. 
Hu et al (2020e) employs the transformer architecture to learn the mutual attention 
between nodes, and sample the nodes according to different node types. 


6.3.1 Node-wise Sampling 


Rather than use all the nodes in the graph, the first approach selects certain nodes 
through various sampling algorithms to construct large-scale GNNs. GraphS AGE (Hamil- 
ton et al, 2017b) and VR-GCN (Chen et al, 201 8d) are two pivotal studies that utilize 
such a method. 


6.3.1.1 GraphSAGE 


At the early stage of GNN development, most work target at the transductive learn- 
ing on a fixed-size graph (Kipf and Welling, 2017b, 2016), while the inductive 
setting is more practical in many cases. Yang et al (2016b) develops an inductive 
learning on graph embeddings, and GraphSAGE Hamilton et al (2017b) extends the 
study on large graphs. The overall architecture is illustrated in Fig. 6.3. 


1. Sample neighborhood 2. Aggregate feature information 3. Predict graph context and label 
from neighbors using aggregated information 


Fig. 6.3: Overview of the GraphSAGE architecture. Step 1: sample the neighbor- 
hoods of the target node; step 2: aggregate feature information from the neighbors; 
step 3: utilize the aggregated information to predict the graph context or label. Fig- 
ure excerpted from (Hamilton et al, 2017b). 


GraphSAGE can be viewed as an extension of the original Graph Convolutional 
Network (GCN) (Kipf and Welling, 2017b). The first extension is the generalized 
aggregator function. Given Y(V, £), M (v) is the neighborhood of v, h is the repre- 
sentation of the node, the embedding generation at the current (/+1)-th depth from 
the target node v € Y can be formulated as, 
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n'';) = AGGREGATE ; (L, vu EN (v)}) 


Different from the original mean aggregator in GCN, GraphSAGE proposes LSTM 
aggregator and Pooling aggregator to aggregate the information from the neigh- 
bors. The second extension is that GraphSAGE applies the concatenation function 
to combine the information of target node and neighborhoods instead of the sum- 
mation function: 


nt) =6 (wr -CONCAT (Hy hy) i 


where W+) are the weight matrices, and o is the activation function. 

In order to make GNN suitable for the large-scale graphs, GraphSAGE intro- 
duces the mini-batch training strategy to reduce the computation cost during the 
training phase. Specifically, in each training iteration, only the nodes that are used 
by computing the representations in the batch are considered, which significantly 
reduces the number of sampled nodes. Take layer 2 in Fig. 6.4(a) as an example, 
unlike the full-batch training which takes all 11 nodes into consideration, only 6 
nodes are involved for mini-batch training. However, the simple implementation of 
mini-batch training strategy suffers the neighborhood expansion problem. As shown 
in layer 1 of Fig. 6.4(a), most of the nodes are sampled since the number of sampled 
nodes grows exponentially if all the neighbors are sampled at each layer. Thus, all 
the nodes are selected eventually if the model contains many layers. 


Mini-batch training © Sampled node Fixed-size neighbor sampling 
© Sampling candidates. 
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(a) Mini-batch training, Batch Size=1 (b) Fix-size neighbor sampling, S=2 


Fig. 6.4: Visual comparison between mini-batch training and fixed-size neighbor 
sampling. 


To further improve the training efficiency and eliminate the neighborhood expan- 
sion problem, GraphSAGE adopts fixed-size neighbor sampling strategy. In specific, 
a fixed-size set of neighbor nodes are sampled for each layer for computing, instead 
of using the entire neighborhood sets. For example, one can set the fixed-size set as 
two nodes, which is illustrated in Fig. 6.4(b), the yellow nodes represent the sampled 
nodes, and the blue nodes are the candidate nodes. It is observed that the number of 
sampled nodes is significantly reduced, especially for layer 1. 
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In summary, GraphSAGE is the first to consider inductive representation learn- 
ing on large graphs. It introduces a generalized aggregator, the mini-batch training, 
and fixed-size neighbor sampling algorithm to accelerate the training process. How- 
ever, fixed-size neighbor sampling strategy can not totally avoid the neighborhood 
expansion problem. Also, there is no theoretical guarantees for the sampling quality. 


6.3.1.2 VR-GCN 


In order to further reduce the size of the sampled nodes, as well as conduct a com- 
prehensive theoretical analysis, VR-GCN (Chen et al, 2018d) proposes a Control 
Variate Based Estimator. It only samples an arbitrarily small size of the neighbor 
nodes by employing historical activations of the nodes. Fig. 6.5 compares the recep- 
tive field of one target node using different sampling strategies. For the implementa- 
tion of the original GCN (Kipf and Welling, 2017b), the number of sampled nodes is 
increased exponentially with the number of layers. With neighbor sampling, the size 
of the receptive field is reduced randomly according to the preset sampling number. 
Compared with them, VR-GCN utilizes the historical node activations as a control 
variate to keep the receptive field small scaled. 


Layer 2 Layer 2 Layer 2 


ao eo a 
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Fig. 6.5: Illustration of the receptive field of a single node utilizing different sam- 
pling strategies with a two-layer graph convolutional neural network. The red circle 
represents the latest activation, and the blue circle indicates the historical activation. 
Figure excerpted from (Chen et al, 2018d). 


The neighbor sampling (NS) algorithm proposed by GraphS AGE (Hamilton et al, 
2017b) can be formulated as: 


NSP :=R E Awb, R=N (v)/d®, 
uc ND (v) 


where M (v) represents the neighbor set of node v, d () is the sampled size of the 
neighbor nodes at layer 1, ^O (v) C MN (v) is the sampled neighbor set of node v at 
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layer /, and A represents the normalized adjacency matrix. Such a method has been 
proved to be a biased sampling, and would cause larger variance. The detailed proof 
can be found in (Chen et al, 2018d). Such properties result in a larger sample size 
NOV) CN (v). 

To address these issues, VR-GCN proposes Control Variate Based Estimator 
() 


v 


(CV Sampler) to maintain all the historical hidden embedding h,’ of every partici- 


pated node. For a better estimation, since the difference between a and nl? shall 
be small if the model weights do not change too fast. CV Sampler is capable of 
reducing the variance and obtaining a smaller sample size Al) (v) eventually. Thus, 
the feed-forward layer of VR-GCN can be defined as, 


H+) <6 (4? (Hen - A) +AH) w0. 


A is the sampled normalized adjacency matrix at layer 1, H 0 = a, e A>} 
is the stack of the historical hidden embedding a”, A) = cn) vee nt )y is 
the embedding of the graph nodes in the (/ + 1)-th layer, and W" is the learnable 
weights matrix. In such a manner, the sampled size of A“ is greatly reduced com- 
pared with GraphSAGE by utilizing the historical hidden embedding a”, which 
introduces a more efficient computing method. Moreover, VR-GCN also studies 
how to apply the Control Variate Estimator on the dropout model. More details can 
be found in the paper. 

In summary, VR-GCN first analyzes the variance reduction on node-wise sam- 
pling, and successfully reduces the size of the samples. However, the trade-off is 
that the additional memory consumption for storing the historical hidden embed- 
dings would be very large. Also, the limitation of applying GNNs on large-scale 
graphs is that it is not realistic to store the full adjacent matrices or the feature ma- 
trices. In VR-GCN, the historical hidden embeddings storage actually increases the 
memory cost, which is not helping from this perspective. 


6.3.2 Layer-wise Sampling 


Since node-wise sampling can only alleviate but not completely solve the neigh- 
borhood expansion problem, layer-wise sampling has been studied to address this 
obstacle. 


6.3.2.1 FastGCN 


In order to solve the neighborhood expansion problem, FastGCN (Chen et al, 2018c) 
first proposes to understand the GNN from the functional generalization perspective. 
The authors point out that training algorithms such as stochastic gradient descent are 
implemented according to the additivity of the loss function for independent data 
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samples. However, GNN models generally lack sample loss independence. To solve 
this problem, FastGCN converts the common graph convolution view to an integral 
transform view by introducing a probability measure for each node. Fig. 6.6 shows 
the conversion between the traditional graph convolution view and the integral trans- 
form view. In the graph convolution view, a fixed number of nodes are sampled in 
a bootstrapping manner in each layer, and are connected if there is a connection 
exists. Each convolutional layer is responsible for integrating the node embeddings. 
The integral transform view is visualized according to the probability measure, and 
the integral transform (demonstrated in the yellow triangle form) is used to calculate 
the embedding function in the next layer. More details can be found in (Chen et al, 
2018c). 


ONT ae 


Graph convolution view Integral transform view 


Fig. 6.6: Two views of GCN. The circles represent the nodes in the graph, while 
the yellow circles indicate the sampled nodes. The lines represent the connection 
between nodes. 


Formally, given a graph Y(Y,&), an inductive graph Y’ with respect to a pos- 
sibility space (Y’, F, p) is constructed. In specific, Y’ denotes the sample space of 
nodes which are iid samples. The probability measure p defines a sampling distri- 
bution, and F can be any event space, e.g., F = 2” '. Take node v and u with same 


probability measure p, g (bE) as the gradient of the final embedding of node 
v, and E as the expectation function, the functional generalization is formulated as, 


L = Erp [s (h(v))] = fg (10) dp). 


(1) (1) 


Moreover, consider sampling t; iid samples u;",...,u;,’ ~ p for each layer l, l = 
0,...,K — 1, a layer-wise estimation of the loss function is admitted as, 


Lig ty eth = ake (nif? (u) , 


which proves that FastGCN samples a fixed number of nodes at each layer. 
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Furthermore, in order to reduce the sampling variance, FastGCN adopts the im- 
portance sampling with respect to the weights in the normalized adjacency matrix. 


at) = MAGDI X JJA (w) 


2 uer, (6.1) 


where A is the normalized adjacency matrix of the graph. Detailed proofs can be 
found in (Chen et al, 2018c). According to Equation 6.1, the entire sampling process 
is independent for each layer, and the sampling probability keeps the same. 
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Fig. 6.7: Comparison between full GCN and FastGCN. 


Compared with GraphSAGE (Hamilton et al, 2017b), FastGCN is much less 
computational costly. Assume t; neighbor nodes are samples for layer /, the neigh- 
borhood expansion size is at most the sum of the t;’s for FastGCN, while could be up 
to the product of the t;’s for GraphSAGE. Fig. 6.7 illustrates the sampling difference 
between Full GCN and FastGCN. In full GCN, the connections are very sparse so 
that it has to compute and update all the gradients, while FastGCN only samples a 
fixed number of samples at each layer. Therefore, the computational cost is greatly 
decreased. On the other hand, FastGCN still retains most of the information accord- 
ing to the importance sampling method. The fixed number of nodes are randomly 
sampled in each training iteration, thus every node and the corresponding connec- 
tions could be selected and fit into the model if the training time is long enough. 
Therefore, the information of the entire graph is generally retained. 

In summary, FastGCN solves the neighborhood expansion problem according to 
the fixed-size layer sampling. Meanwhile, this sample strategy has a quality guaran- 
tee. However, since FastGCN samples each layer independently, it failed to capture 
the between-layer correlations, which leads to a performance compromise. 


6.3.2.2 ASGCN 


To better capture the between-layer correlations, ASGCN (Huang et al, 2018) pro- 
poses an adaptive layer-wise sampling strategy. In specific, the sampling probability 
of lower layers depends on the upper ones. As shown in Fig. 8(a), ASGCN only 
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samples nodes from the neighbors of the sampled node (yellow node) to obtain the 
better between-layer correlations, while FastGCN utilizes the importance sampling 


among all the nodes. 


ii EKD E 


ASGCN FastGCN 
(a) ASGCN vs. FastGCN. 
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(b) Top-down sampling of ASGCN. 


Fig. 6.8: A demonstration of the sampling strategies used in ASGCN. 


Meanwhile, the sampling process of ASGCN is performed in a top-down man- 
ner. As shown in Fig. 8(b), the sampling process is first conducted in the output 
layer, which is the layer 3. Next, the participated nodes of the intermediate layer 
are sampled according to the results of the output layer. Such a sampling strategy 
captures dense connections between layers. 

The sampling probability of lower layers depends on the upper ones. Take 
Fig. 6.9 as an illustration, p (uj | vi) is the probability of sampling node uj given 
node v;, v; refers to node i in the (/+1)-th layer while u; denotes node j in the /-th 
layer, n’ represents the sampled node number in every layer while n is the number of 
all the nodes in the graph, q (uj | v1,--- ,vw ) is the probability of sampling uj given 
all the nodes in the current layer, and â (v;,uj) represents the entry of node v; and 
uj; in the re-normalized adjacency matrix A. The sampling probability q(uj) can be 
written as, aa ina 

P\Yj | Vi 
q(x) q(uj | V1---Vp') 
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Fig. 6.9: Network construction example: (a) node-wise sampling; (b) layer-wise 
sampling; (c) skip connection implementation. Figure excerpted from (Huang et al, 
2018). 


To further reduce the sampling variance, ASGCN introduces the explicit vari- 
ance reduction to optimize the sampling variance as the final objective. Consider 
x (uj) as the node feature of node uj, the optimal sampling probability q* (uj) can 
be formulated as, 


g (a) = ZE (us lvi) |e (e )| 
"yr sy" p (u lvi) lg (æ w) 


, 8 (æ (uj)) = Wea (uj). (6.2) 


However, simply utilizing the sampler given by Equation 6.2 is not sufficient 
to secure a minimal variance. Thus, ASGCN designs a hybrid loss by adding the 
variance to the classification loss -%, as shown in Equation 6.3. In such a manner, 
the variance can be trained to achieve the minimal status. 


n! 


= 5 La iF (Pa (vi))) +4 Varg (fa (vi)), (6.3) 


where y; is the ground-truth label, fz, (v;) represents the output hidden embeddings 
of node v;, and ¥ (ft, (v;)) is the prediction. A is involved as a trade-off parameter. 
The variance reduction term A Var, (ftg (vi)) can also be viewed as a regularization 
according to the sampled instances. 

ASGCN also proposes a skip connection method to obtain the information across 
distant nodes. As shown in Fig. 6.9 (c), the nodes in the (/-1)-th layer theoretically 
preserve the second-order proximity (Tang et al, 2015b), which are the 2-hop neigh- 
bors for the nodes in the (/+1)-th layer. The sampled nodes will include both 1-hop 
and 2-hop neighbors by adding a skip connection between the (/-1)-th layer and the 
(/+1)-th layer, which captures the information between distant nodes and facilitates 
the model training. 

In summary, by introducing the adaptive sampling strategy, ASGCN has gained 
better performance as well as equips a better variance control. However, it also 
brings in the additional dependence during sampling. Take FastGCN as an example, 
it can perform parallel sampling to accelerate the sampling process since each layer 
is sampled independently. While in ASGCN, the sampling process is dependent to 
the upper layer, thus parallel processing is not applicable. 
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6.3.3 Graph-wise Sampling 
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Fig. 6.10: An illustration of graph-wise sampling on large-scale graph. 


Other than layer-wise sampling, the graph-wise sampling strategy is introduced 
recently to accomplish efficient training on large-scale graphs. As shown in Fig. 6.10, 
a whole graph can be sampled into several sub-graphs and fit into the GNN models, 
in order to reduce the computational cost. 


6.3.3.1 Cluster-GCN 


Cluster-GCN (Chiang et al, 2019) first proposes to extract small graph clusters based 
on efficient graph clustering algorithms. The intuition is that the mini-batch algo- 
rithm is correlated with the number of links between nodes in one batch. Hence, 
Cluster-GCN constructs mini-batch on the sub-graph level, while previous studies 
usually construct mini-batch based on the nodes. 

Cluster-GCN extracts small clusters based on the following clustering algo- 
rithms. A graph Y(¥,&) can be devided into c portions by grouping its nodes, 
where Y = [%,---%]. The extracted sub-graphs can be defined as, 


g = KIRE KA > HA, {K ét. 


(%, &) represents the nodes and the links within the t-th portion, t € (1,c). And the 
re-ordered adjacency matrix can be written as, 


Ait: Ale Au 0 0 -Ale 
A=Ā+A=| 5 213 A=] ic: pAs| in | 
Äg «i+ Axe 0 +++ Ace Aa 0 
Different graph clustering algorithms can be used to partition the graph by enabling 
more links between nodes within the cluster. The motivation of considering sub- 


graph as a batch also follows the nature of graphs, which is that neighbors usually 
stay closely with each other. 
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© Sampled node © Sampling candidates. 


GraphSAGE: Fix-size neighbor sampling S=2 ClusterGCN 


Fig. 6.11: Comparison between GraphSAGE and Cluster-GCN. In Cluster-GCN, it 
only samples the nodes in each sub-graph. 


Obviously, this strategy can avoid the neighbor expansion problem since it only 
samples the nodes in the clusters, as shown in Fig. 6.11. For Cluster-GCN, since 
there is no connection between the sub-graphs, the nodes in other sub-graphs will 
not be sampled when the layer increases. In such a manner, the sampling process 
establishes a neighbor expansion control by sampling over the sub-graphs, while in 
layer-wise sampling the neighbor expansion control is implemented by fixing the 
neighbor sampling size. 

However, there still remain two concerns with the vanilla Cluster-GCN. The first 
one is that the links between sub-graphs are dismissed, which may fail to capture 
important correlations. The second issue is that the clustering algorithm may change 
the original distribution of the dataset and introduce some bias. To address these 
concerns, the authors propose stochastic multiple partitions scheme to randomly 
combine clusters to a batch. In specific, the graph is first clustered into p sub-graphs; 
then in each epoch training, a new batch is formed by randomly combine q clusters 
(q < p), and the interactions between clusters are included too. Fig. 6.12 visualized 
an example when q equals to 2. As observed, the new batch is formed by 2 random 
clusters, along with the retained connections between the clusters. 


Original graph Epoch 1 


DR Te Q—9 


Fig. 6.12: An illustration of stochastic multiple partitions scheme. 
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In summary, Cluster-GCN is a practical solution based on the sub-graph batch- 
ing. It has good performance and good memory usage, and can alleviate the neigh- 
borhood expansion problem in traditional mini-batch training. However, Cluster- 
GCN does not analyze the sampling quality, e.g., the bias and variance of this sam- 
pling strategy. In addition, the performance is highly correlated to the clustering 
algorithm. 


6.3.3.2 GraphSAINT 


Instead of using clustering algorithms to generate the sub-graphs which may bring in 
certain bias or noise, GraphSAINT (Zeng et al, 2020a) proposes to directly sample a 
sub-graph for mini-batch training according to sub-graph sampler, and employ a full 
GCN on the sub-graph to generate the node embeddings as well as back-propagate 
the loss for each node. As shown in Fig. 6.13, sub-graph % is constructed from the 
original graph Y with Nodes 0, 1, 2, 3, 4, 7 included. Next, a full GCN is applied 
on these 6 nodes along with the corresponding connections. 


G, = SAMPLE(@) Full GCN on @, 


Fig. 6.13: An illustration of GraphSAINT training algorithm. The yellow circle in- 
dicates the sampled node. 


GraphSAINT introduces three sub-graph sampler constructions to form the sub- 
graphs, which are node sampler, edge sampler and random walk sampler (Fig. 6.14). 
Given graph Y(VY,&), node v E€ Y, edge (u,v) E€ £, the node sampler randomly 
samples % nodes from ¥. The edge sampler selects the sub-graph based on the 
probability of edges in the original graph Y. The random walk sampler picks node 
pairs according to the probability that there exists L hops paths from node u to v. 

Moreover, GraphSAINT provides comprehensive theoretical analysis on how to 
control the bias and variance of the sampler. First, it proposes loss normalization 
and aggregation normalization to eliminate the sampling bias. 


Loss normalization: batch = 2 L/w, Av=|¥|pv 


VEI; 


Aggregation normalization: a(u,v) = puy/Py 
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where p, is the probability of a node v € VY being sampled, pu, is the probability 
of an edge (u,v) € & being sampled, L, represents the loss of v in the output layer. 
Second, GraphSAINT also proposes to minimize the sampling variance by adjusting 
the edge sampling probability by: 


Puy & 1/du + 1/d,. 


The extensive experiments demonstrate the effectiveness and efficiency of Graph- 
SAINT, and prove that GraphSAINT converges fast as well as achieves superior 
performance. 

In summary, GraphSAINT proposes a highly flexible and extensible frame- 
work including the graph sampler strategies and the GNN architectures, as well 
as achieves good performance on both accuracy and speed. 


6.3.3.3 Overall Comparison of Different Models 


Table 6.2 compares and summarizes the characteristics of previously mentioned 
models. Paradigm indicates the different sampling paradigms, and Model defers to 
the proposed method in each paper. Sampling Strategy shows the sampling theory, 
and Variance Reduction denotes whether such analysis is conducted in the paper. 
Solved Problem represents the problem that proposed model has addressed, and 
Characteristic extracts the features of the model. 


Node sampler Edge sampler Random walk sampler 
Q- 
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Uniformly sample nodes. Puy X 1/d, + 1/d, random walk to start at u 
dau: the degree of node u and end at v in L hops. 


Fig. 6.14: An illustration of different samplers. 
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Table 6.2: The comparison between different models. 


Sampling Variance Solved 


Paradigat Model Strategy Reduction Problem 


Characteristics 


Mini-batch training, 


GraphSAGE (HamilRandom x ene reduce neighborhood 
Nodecwise "St al, 2017b) expansion. 
Sampling ; cton 
VR-GCN (Chen Random ¥ oo ee 
et al, 2018d) P ; 
FastGCN (Chen Importance V N eighborhood Integral transform 
expansion view. 
L : et al, 2018c) 
ayer-wise 
Sampling p Explicit variance 
ASGCN (Huang Importance v pee ae reduction, skip 
correlation ; 
et al, 2018) connection. 
Cluster-GCN (Chi- Random v Graph batching i 7 
Graph-wise ang et al, 2019) subd-graph. 
Sampling Ed : : : 
ge Neighborhood Variance and bias 
rapto ANL (Zene Probability expansion control. 


et al, 2020a) 


6.4 Applications of Large-scale Graph Neural Networks on 
Recommendation Systems 


Deploying large-scale neural networks in academia has achieved remarkable suc- 
cess. Other than the theoretical study on how to expand the GNNs on large graphs, 
another crucial problem is how to embed the algorithms into industrial applications. 
One of the most conventional applications that demand tremendous data is the rec- 
ommendation systems, which learn the user preferences and make predictions for 
what the users may interest in. Traditional recommendation algorithms like collabo- 
rative filtering are mainly designed according to the user-item interactions(Goldberg 
et al, 1992; Koren et al, 2009; Koren, 2009; He et al, 2017b). Such methods are not 
capable of the explosive increased web-scale data due to the extreme sparsity. Re- 
cently, graph-based deep learning algorithms have gained significant achievements 
on improving the prediction performance of recommendation systems by modeling 
the graph structures of web-scale data (Zhang et al, 2019b; Shi et al, 2018a; Wang 
et al, 2018b). Therefore, utilizing large-scale GNNs for recommendation has be- 
come a trend in industry (Ying et al, 2018b; Zhao et al, 2019b; Wang et al, 2020d; 
Jin et al, 2020b). 

Recommendation systems can be typically categorized into two fields: item-item 
recommendation and user-item recommendation. The former one aims to find the 
similar items based on a user’s historical interactions; while the later one directly 
predicts the user’s preferred items by learning the user behaviors. In this chapter, 
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we briefly introduce notable recommendation systems that are implemented on large 
graphs for each field. 


6.4.1 Item-item Recommendation 


PinSage (Ying et al, 2018b) is one of the successful applications in the early stage 
of utilizing large-scale GNNs on item-item recommendation systems, which is de- 
ployed on Pinterest! . Pinterest is a social media application that shares and discovers 
various content. The users mark their interested content with pins and organize them 
on the boards. When the users browse the website, Pinterest recommends the poten- 
tially interesting content for them. By the year 2018, the Pinterest graph contains 2 
billion pins, 1 billion boards, and over 18 billion edges between pins and boards. 

In order to scale the training model on such a large graph, Ying et al (2018b) 
proposes PinSage, a random-walk-based GCN, to implement node-wise sampling 
on Pinterest graph. In specific, a short random walk is used to select a fixed-number 
neighborhood of the target node. Fig. 6.15 demonstrates the overall architecture of 
PinSage. Take node A as an example, a 2-depth convolution is constructed to gen- 
erate the node embedding he). The embedding vector h?) (A) of node A’s neighbors 
are aggregated by node B, C, and D. Similar process is established to get the 1-hop 
neighbors’ embedding hO, hË ) and hi. An illustration of all participated nodes 
for each node from the input graph is shown at the bottom of Fig. 6.15. In addition, 
a L1-normalization is computed to sort the neighbors by their importance (Eksom- 
batchai et al, 2018), and a curriculum training strategy is used to further improve the 
prediction performance by feeding harder-and-harder examples. 

A series of comprehensive experiments that are conducted on Pinterest data, e.g., 
offline experiments, production A/B tests and user studies, have demonstrated the 
effectiveness of the proposed method. Moreover, with the adoption of highly effi- 
cient MapReduce inference pipeline, the entire process on the whole graph can be 
finished within one day. 


6.4.2 User-item Recommendation 


Unlike item-item recommendation, user-item recommendation systems is more 
complex since it aims at predicting the user’s behaviors. Moreover, there remains 
more auxiliary information between users and users, items and items, and users and 
items, which leads to a heterogeneous graph problem. As shown in Fig. 6.16, there 
are various properties of the edges between user-user and item-item, which cannot 
be considered as one simple relation, e.g., user searches a word or visits a shop 
should be considered with different impacts. 


‘https: //www.pinterest.com/ 
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Fig. 6.15: Overview of PinSage architecture. Colored nodes are applied to illustrate 
the construction of graph convolutions. 
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Fig. 6.16: Examples of heterogeneous auxiliary relationships on e-commerce web- 
sites. 


IntentGC (Zhao et al, 2019b) proposes a GCN-based framework for large-scale 
user-item recommendation on e-commerce data. It explores the explicit user prefer- 
ences as well as the abundant auxiliary information by graph convolutions and make 
predictions. E-commerce data such as Amazon contains billions of users and items, 
while the diverse relationships bring in more complexity. Thus, the graph structure 
gets larger and more complicated. Moreover, due to the sparsity of user-item graph 
network, sampling methods like GraphSAGE may result in a very huge sub-graph. 
In order to train efficient graph convolutions, IntentGC designs a faster graph con- 
volution mechanism to boost the training, named as IntentNet. 

As shown in Fig. 6.17, the bit-wise operation illustrates the traditional way of 
node embedding construction in GNN. In specific, consider node v as the target 


(1+1) 
v 


node, the embedding vector h is generated by concatenating the neighborhoods’ 
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Fig. 6.17: Comparison between bit-wise and vector-wise graph convolution. 


embeddings hy (v) and the target itself ni, Such an operation is able to capture two 


types of information: the interactions between target node and its neighborhoods; 
and the interactions between different dimensions of the embedding space. How- 
ever, in user-item networks, learning the information between different feature di- 
mensions may be less informative and unnecessary. Therefore, IntentNet designs a 
vector-wise convolution operation as follows: 


a) = o (WP E1 -aP +W9,2)-HY,)), 
By? =6 (10 gP), 


where we? (i, 1) and AMG ,2) are the associated weight matrices for the i-th local 
filter. gP (i) represents the operation that learns the interactions between the target 
node and its neighbor nodes in a vector-wise manner. Another vector-wise layer is 
applied to gather the final embedding vector of the target node for the next convolu- 
tional layer. Moreover, the output vector of the last convolutional layer is fed into a 
three-layer fully-connected network to further learn the node-level combinatory fea- 
tures. Such an operation significantly promotes the training efficiency and reduces 
the time complexity. 

Extensive experiments are conducted on Taobao and Amazon datasets, which 
contain millions to billions of users and items. IntentGC outperforms other baseline 
methods, as well as reduces the training time for about two days compared with 
GraphSAGE. 


6.5 Future Directions 


Overall, in recent years, the scalability of GNNs has been extensively studied and 
has achieved fruitful results. Fig. 6.18 summarizes the development towards large- 
scale GNNs. 
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Fig. 6.18: Overall performance comparison of introduced work on large-scale 
GNNs. 


GraphSAGE is the first to propose sampling on the graph instead of computing 
on the whole graph. VR-GCN designs another node sampling algorithm and pro- 
vides a comprehensive theoretical analysis, but the efficiency is still limited. Fast- 
GCN and ASGCN propose to sample over layers, and both prove the efficiency with 
detailed analysis. Cluster-GCN first partitions the graph into sub-graphs to elimi- 
nate the neighborhood expansion problem, and boosts the performance of several 
benchmarks. GraphSAINT further improves the graph-wise sampling algorithm to 
achieve the state-of-the-art classification performance over commonly used bench- 
mark datasets. Various industrial applications prove the effectiveness and practica- 
bility of large-scale GNNs in the real world. 

However, many new open problems arise, e.g., how to balance the trade-off be- 
tween variance and bias during sampling; how to deal with complex graph types 
such as heterogeneous/dynamic graphs; how to properly design models over com- 
plex GNN architectures. Studies toward such directions would improve the devel- 
opment of large-scale GNNs. 


Editor’s Notes: For graphs of large scale or with rapid expansibility, such 
as dynamic graph (chapter 15) and heterogeneous graph (chapter 16), the 
scalability characterization of GNNs is of vital importance to determine 
whether the algorithm is superior in practice. For example, graph sampling 
strategy is especially necessary to ensure computational efficiency in in- 
dustrial scenarios, such as recommender system (chapter 19) and urban in- 
telligence (chapter 27). With the increasing complexity and scale of the 
real problem, the limitation in scalability has been considered almost ev- 
erywhere in the study of GNNs. Researchers devoted to graph embedding 
(chapter 2), graph structure learning (chapter 14) and self-supervised learn- 
ing (chapter 18) put forward very remarkable works to overcome it. 


A 
Chapter 7 ml 
Interpretability in Graph Neural Networks 


Ninghao Liu and Qizhang Feng and Xia Hu 


Abstract Interpretable machine learning, or explainable artificial intelligence, is ex- 
periencing rapid developments to tackle the opacity issue of deep learning tech- 
niques. In graph analysis, motivated by the effectiveness of deep learning, graph 
neural networks (GNNs) are becoming increasingly popular in modeling graph data. 
Recently, an increasing number of approaches have been proposed to provide ex- 
planations for GNNs or to improve GNN interpretability. In this chapter, we offer 
a comprehensive survey to summarize these approaches. Specifically, in the first 
section, we review the fundamental concepts of interpretability in deep learning. In 
the second section, we introduce the post-hoc explanation methods for understand- 
ing GNN predictions. In the third section, we introduce the advances of developing 
more interpretable models for graph data. In the fourth section, we introduce the 
datasets and metrics for evaluating interpretation. Finally, we point out future direc- 
tions of the topic. 


7.1 Background: Interpretability in Deep Models 


Deep learning has become an indispensable tool for a wide range of applications 
such as image processing, natural language processing, and speech recognition. De- 
spite the success, deep models have been criticized as “black boxes” due to their 
complexity in processing information and making decisions. In this section, we in- 
troduce the research background of interpretability in deep models, including the 
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definition of interpretability/interpretation, the reasons for exploring model inter- 
pretation, the methods of obtaining interpretation in traditional deep models, the 
opportunities and challenges to achieve interpretability in GNN models. 


7.1.1 Definition of Interpretability and Interpretation 


There is no unified mathematical definition of interpretability. A commonly used 
(nonmathematical) definition of interpretability is given below (Miller, 2019). 


Definition 7.1. Interpretability is the degree to which an observer can understand 
the cause of a decision. 


There are three elements in the above definition: “understand”, “cause”, and “a 
decision”. According to different scenarios, it is common that these elements are 
re-weighted or even some elements are replaced. First, in the context of machine 
learning systems where the role of humans needs to be emphasized, the definition 
of interpretability is usually revised to adapt to humans (Kim et al, 2016), where in- 
terpretation results that better facilitate human understanding and reasoning habits 
are more desirable. Second, from the term “cause” in the definition, it is natural to 
think that interpretation studies causality properties in models. While causality is 
important in developing certain types of interpretation methods, it is also common 
that interpretation is obtained beyond the framework of causal theories. Third, there 
is an increasing number of techniques that jump out of the scheme of explaining “a 
decision”, and try to understand a broader range of entities such as model compo- 
nents (Olah et al, 2018) and data representations. 

The interpretation is one mode in which an observer may obtain an understand- 
ing of a model or its predictions. A general and widely followed definition is as 
below (Montavon et al, 2018). 


Definition 7.2. An interpretation is the mapping of an abstract concept into a do- 
main that the human can understand. 


Typical examples of human-understandable domains include arrays of pixels in im- 
ages or words in texts. There are two elements that merit attention in the above defi- 
nition: “concept” and “understand”. First, the “concept” to be explained could refer 
to different aspects, such as a predicted class (i.e., the logit value of the predicted 
class), the perception of a model component, or the meaning of a latent dimension. 
Second, in specific scenarios where user experience is important, it is necessary to 
transfer raw interpretation to the format that facilitates user comprehension, some- 
times even with the cost of sacrificing interpretation accuracy. 

It is also worth noting that, in this work, we distinguish between “interpretation” 
and “explanation”. Although their differences have not been formally defined, in lit- 
eratures, explanation mainly refers to the collection of important features for a given 
prediction (e.g., classification or regression) (Montavon et al, 2018). Meanwhile, 
“explanation” is more likely to be used if we are studying post-hoc interpretation 


7 Interpretability in Graph Neural Networks 123 


ÁD Be 


Q O 
O Ee 
AN O < Interpretation 
MAN Interpretation 
t 
End Users Developers 


Fig. 7.1: Left: Interpretation could benefit user experiences in interaction with 
models. Right: Through interpretation, we could identify model behaviors that are 
not desirable according to humans, and work on improving the model accord- 
ingly (Ribeiro et al, 2016). 


or human-understandable interpretation. “Interpretation” usually refers to a broader 
range of concepts, especially to emphasize that the model itself is intrinsically in- 
terpretable (i.e., the transparency of the model). 


7.1.2 The Value of Interpretation 


There are several pragmatic reasons that motivate people to study and improve 
model interpretability. Depending on who finally benefits from interpretation, we 
divide the reasons into model-oriented and user-oriented, as shown in Fig. 7.1. 


7.1.2.1 Model-Oriented Reasons 


Interpretation is an effective tool to diagnose the defects in models and provide 
directions on how to improve. Therefore, after several iterations of model updates, 
it is possible to obtain better models with particular properties coming about, and we 
could apply these models to our advantage. There are several properties that have 
been considered in literatures that are summarized as below. 


1. Credibility: A model is regarded as credible if the rationale used behind pre- 
dictions is consistent with the well-established domain knowledge. Through 
interpretation, we could observe whether the predictions are based on proper 
evidences, or they are simply from the exploitation of artifacts in data. By 
extracting explanations from a model and making the explanations to match 
human-annotated evidences in data, we are able to improve the model’s credi- 
bility when making decisions (Du et al, 2019; Wang et al, 2018c). 

2. Fairness: Machine learning systems have the risk of amplifying societal stereo- 
types if they rely on sensitive attributes, such as race, gender and age, in making 
predictions. Through interpretation, we could observe whether the predictions 
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are based on sensitive features that are required to be avoided in real applica- 
tions. 

3. Adversarial-Attack Robustness: Adversarial attack refers to adding carefully- 
crafted perturbations to input, where the perturbations are almost imperceptible 
to humans, but can cause the model to make wrong predictions (Goodfellow 
et al, 2015). Robustness against adversarial attacks is an increasingly impor- 
tant topic in machine learning security. Recent studies have shown how inter- 
pretation could help in discovering new attack schemes and designing defense 
strategies (Liu et al, 2020d). 

4. Backdoor-Attack Robustness: Backdoor attack refers to injecting malicious 
functionality into a model, by either implanting additional modules or poison- 
ing training data. The model will behave normally unless it is fed with input 
containing patterns that trigger the malicious functionality. Studying model ro- 
bustness against backdoor attacks is attracting more interest recently. Recent 
research discovers that interpretation could be applied in identifying if a model 
has been infected by backdoors (Huang et al, 2019c; Tang et al, 2020a). 


7.1.2.2 User-Oriented Reasons 


The interpretation could contribute to the construction of interfaces between humans 
and machines. 


1. Improving User Experiences: By providing intuitive visual information, inter- 
pretation could gain user trust, and increase a system’s ease of use. For example, 
in healthcare-related applications, if the model could explain to patients how it 
makes diagnoses, the patients would be more convinced (Ahmad et al, 2018). 
For another example, in a recommender system, providing explanations can 
help users to make faster decisions and persuade users to purchase the recom- 
mended products (Li et al, 2020c). 

2. Facilitating Decision Making: In many applications, a model plays the role as 
an assistant, while humans will make the final decision. In this case, interpreta- 
tion helps shape human understandings towards instances, thus affecting subse- 
quent decision-making processes. For example, in outlier detection, some out- 
liers own malicious properties that should be handled with caution, while some 
are benign instances that simply happen to be “different”. With interpretation, it 
is much easier for human decision-makers to understand whether a given outlier 
is malicious or benign. 


7.1.3 Traditional Interpretation Methods 


In general, there are two categories of techniques in improving model interpretabil- 
ity. Some efforts have been paid to build more transparent models, and we are able 
to grasp how the models (or parts of the models) work. We call this direction as 
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Fig. 7.2: Illustration of post-hoc interpretation methods. (a): Local approximation 
based interpretation. (b): Layer-wise relevance propagation. (c): Explanation based 
on perturbation. (d): Explaining the meaning of latent representation dimensions. 
(e): Explaining the meaning of neurons in a convolutional neural network via input 
generation. 


interpretable modeling. Meanwhile, instead of elucidating the internal mechanisms 
by which models work, some methods investigate post-hoc interpretation to pro- 
vide explanations to models that are already built. In this part, we introduce the 
techniques of the two categories. Some of the methods provide motivation for GNN 
interpretation which will be introduced in later sections. 


7.1.3.1 Post-Hoc Interpretation 


The post-hoc interpretation has received a lot of interests in both research and real 
applications. Flexibility is one of the advantages of post-hoc interpretation, as it put 
less requirement on the model types or structures. In the following paragraphs, we 
briefly introduce several commonly used methods. The illustration of the basic idea 
behind each of these methods is shown in Fig. 7.2. 

The first type of methods to be introduced is approximation-based methods. 
Given a function f that is complex to understand and an input instance x* € R”, we 
could approximate f with a simple and understandable surrogate function h (usually 
chosen as a linear function) locally around x*. Here m is the number of features in 
each instance. There are several ways to build h. A straightforward way is based on 
the first-order Taylor expansion, where: 


f(x) © A(x) = f(x") +w! -(x—x*), (7.1) 


where w € R” tells how sensitive the output is to the input features. Typically, w 
can be estimated with the gradient (Simonyan et al, 2013), so that w = Vxf(x*). 
When gradient information is not available, such as in tree-based models, we could 
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build A through training (Ribeiro et al, 2016). The general idea is that a number of 
training instances (x', f(x')), 1 <i < n are sampled around x*, i.e., |x? —x*|| < €. 
The instances are then used to train h, so that h approximates f around x*. 

Besides directly studying the sensitivity between input and output, there is an- 
other type of method called layer-wise relevance propagation (LRP) (Bach et al, 
2015). Specifically, LRP redistributes the activation score of output neuron to its 
predecessor neurons, which iterates until reaching the input neurons. The redistri- 
bution of scores is based on the connection weights between neurons in adjacent 
layers. The share received by each input neuron is used as its contribution to the 
output. 

Another way to understand the importance of a feature x; is to answer questions 
like “What would have happened to f, had x; not existed in input?” If x; is important 
for predicting f(x), then removing/weakening it will cause a significant drop in 
prediction confidence. This type of method is called the perturbation method (Fong 
and Vedaldi, 2017). One of the key challenges in designing perturbation methods is 
how to guarantee the input after perturbation is still valid. For example, it is argued 
that perturbation on word embedding vectors cannot explain deep language models, 
because texts are discrete symbols, and it is hard to identify the meaning of perturbed 
embeddings. 

Different from the previous methods that focus on explaining prediction results, 
there is another type of method that tries to understand how data is represented in- 
side a model. We call it representation interpretation. There is no unified definition 
for representation interpretation. The design of methods under this category is usu- 
ally motivated by the nature of the problem or the properties of data. For example, 
in natural language processing, it has been shown that a word embedding could be 
understood as the composition of a number of basis word embeddings, where the 
basis words constitute a dictionary (Mathew et al, 2020). 

Besides understanding predictions and data representations, another interpreta- 
tion scheme is to understand the role of model components. A well-known example 
is to visualize the visual patterns that maximally activate the target neuron/layer in 
a CNN model (Olah et al, 2018). In this way, we understand what kind of visual 
signal is detected by the target component. The interpretation is usually obtained 
through a generative process, so that the result is understandable to humans. 


7.1.3.2 Interpretable Modeling 


Interpretable modeling is achieved via incorporating interpretability directly into 
the model structures or learning process. It is still an extremely challenging prob- 
lem to develop models that are both transparent and could achieve state-of-the-art 
performances. Many efforts have been paid to improve the intrinsic interpretability 
of deep models. Some details are discussed as below. 

A straightforward strategy is to rely on distillation. Specifically, we first build 
a complex model (e.g., a deep model) to achieve good performance. Then, we use 
another model, which is readily recognized as interpretable, to mimic the predictions 
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of the complex model. The pool of interpretable models includes linear models, 
decision trees, rule-based models, etc. This strategy is also called mimic learning. 
The interpretable model trained in this way tends to perform better than normal 
training, and it is also much easier to understand than the complex model. 

Attention models, originally introduced for machine translation tasks, have now 
become enormously popular, partially due to their interpretation properties. The in- 
tuition behind attention models can be explained using human biological systems, 
where we tend to selectively focus on some parts of the input, while ignoring other 
irrelevant parts (Xu et al, 2015). By examining attention scores, we could know 
which features in the input have been used for making the prediction. This is also 
similar to using post-hoc interpretation algorithms that find which input features are 
important. The major difference is that attention scores are generated during model 
prediction, while post-hoc interpretation is performed after prediction. 

Deep models heavily rely on learning effective representations to compress in- 
formation for downstream tasks. However, it is hard for humans to understand the 
representations as the meanings of different dimensions are unknown. To tackle this 
challenge, disentangled representation learning has been proposed. Disentangled 
representation learning breaks down features of different meanings and encodes 
them as separate dimensions in representations. As a result, we could check each 
dimension to understand which factors of input data are encoded. For example, af- 
ter learning disentangled representations on 3D-chair images, factors such as chair 
leg style, width and azimuth, are separately encoded into different dimensions (Hig- 
gins et al, 2017). 


7.1.4 Opportunities and Challenges 


Despite the major progress made in domains such as vision, language and control, 
many defining characteristics of human intelligence remain out of reach for tradi- 
tional deep models such as convolutional neural networks (CNNs), recurrent neural 
networks (RNNs) and multi-layer perceptrons (MLPs). To look for new model ar- 
chitectures, people believe that GNN architectures could lay the foundation for more 
interpretable patterns of reasoning (Battaglia et al, 2018). In this part, we discuss the 
advantages of GNNs and challenges to be tackled in terms of interpretability. 

The GNN architecture is regarded as more interpretable because it facilitates 
learning about entities, relations, and rules for composing them. First, entities are 
discrete and usually represent high-level concepts or knowledge items, so it is re- 
garded as easier for humans to understand than image pixels (tiny granularity) or 
word embeddings (latent space vectors). Second, GNN inference propagates infor- 
mation through links, so it is easier to find the explicit reasoning path or subgraph 
that contributes to the prediction result. Therefore, there is a recent trend of trans- 
forming images or text data into graphs, and then applying GNN models for predic- 
tions. For example, to build a graph from an image, we can treat objects inside the 
image (or different portions within an object) as nodes, and generate links based on 
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the spatial relations between nodes. Similarly, a document can be transformed into a 
graph by discovering concepts (e.g., nouns, named entities) as nodes and extracting 
their relations as links through lexical parsing. 

Although the graph data format lays a foundation for interpretable modeling, 
there are still several challenges that undermine GNN interpretability. First, GNN 
still maps nodes and links into embeddings. Therefore, similar to traditional deep 
models, GNN also suffers from the opacity of information processing in intermedi- 
ate layers. Second, different information propagation paths or subgraphs contribute 
differently to the final prediction. GNN does not directly provide the most impor- 
tant reasoning paths for its prediction, so post-hoc interpretation methods are still 
needed. In the following sections, we will introduce the recent advances in tackling 
the above challenges to improve the explainability and interpretability of GNNs. 


7.2 Explanation Methods for Graph Neural Networks 


In this section, we introduce the post-hoc explanation methods for understand- 
ing GNN predictions. Similar to the categorization in Section 7.1.3, we include 
approximation-based methods, relevance-propagation-based methods, perturbation- 
based methods, and generative methods. 


7.2.1 Background 


Before introducing the techniques, we first provide the definition of graphs and re- 
view the fundamental formulations of a GNN model. 

Graphs: In the rest of the chapter, if not specified, the graphs we discuss are 
limited to homogeneous graphs. 


Definition 7.3. A homogeneous graph is defined as Y = (V, £), where ¥ is the set 
of nodes and & is the set of edges between nodes. 


Furthermore, let A € R”*” be the adjacency matrix of Y, where n = |¥|. For un- 
weighted graphs, A; j is binary, where A; ; = 1 means there exists an edge (i, j) € &, 
otherwise A; ; = 0. For weighted graphs, each edge (i, j) is assigned a weight w; j, 
so A; ; = w; j. In some cases, nodes are associated with features, which could be 
denoted as X € R”*”, and X;,; is the feature vector of node i. The number of fea- 
tures for each node is m. In this chapter, unless otherwise stated, we focus on GNN 
models on homogeneous graphs. 

GNN Fundamentals: Traditional GNNs propagate information via the input 
graph’s structure according to the propagation scheme: 


H+! = 6(D-2AD-2H'W!), (7.2) 
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Fig. 7.3: Illustration of explanation result formats. Explanation results for graph 
neural networks could be the important nodes, the important edges, the important 
features, etc. An explanation method may return multiple types of results. 


where H’ denotes the embedding matrix at layer J, and W’ denotes the trainable 
parameters at layer J. Also, A= A + I denotes the adjacency matrix of the graph 
after adding the self-loop. The matrix D is the diagonal degree matrix of A, i.e., 
Dii =y jAi, j- Therefore, D-2AD-2 normalizes the adjacency matrix. If we only 
focus on the embedding update of node i, the GCN propagation scheme could be 
rewritten as: i 
Hit =0( £ —H; W’), (7.3) 
JEVU{i} “bd 

where H; denotes the j-th row of matrix H, and 7%; denotes the neighbors of node 
i. Here c; j is a normalization constant, and = = (b-2AD-2);, j- Therefore, the 
embedding of node i at layer / can be seen as aggregating neighbor embeddings 
of nodes that are neighbors of node i, followed by some transformations. The em- 
beddings in the first layer H? is usually set as the node features. As the layer goes 
deeper, the computation of each node’s embedding will include further nodes. For 
example, in a 2-layer GNN, computing the embedding of node i will use the infor- 
mation of nodes within the 2-hop neighborhood of node i. The subgraph composed 
by these nodes is called the computation graph of node i, as shown in Fig. 7.3. 

Target Models: There are two common tasks in graph analysis, i.e., graph-level 
predictions and node-level predictions. We use classification tasks as the example. In 
graph-level tasks, the model f(Y) € R© produces a single prediction for the whole 
graph, where C is the number of classes. The prediction score for class c could 
be written as f°(Y). In node-level tasks, the model f(¥Y) € R”*C returns a matrix, 
where each row is the prediction for a node. Some explanation methods are designed 
solely for graph-level tasks, some are for node-level tasks, while some could handle 
both scenarios. The computation graphs introduced above are commonly used in 
explaining node-level predictions. 
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Fig. 7.4: Illustration of several gradient-based explanation methods. Methods rely- 
ing on local gradients may suffer from the saturation problem or noises in input, 
where a feature’s local sensitivity is not consistent with its overall contribution. 
SmoothGrad removes noises in an explanation by averaging multiple explanations 
on nearby points. IG is more accurate than Grad © Input in measuring feature con- 
tribution. 


Interpretation Formats: According to the introduction above, there are several 
input modes that could be included in the explanation as shown in Fig. 7.3. Specif- 
ically, explanation methods could identify what are the important nodes, important 
edges and important features that contribute most to the prediction. Some explana- 
tion methods may identify multiple types of input modes simultaneously. 


7.2.2 Approximation-Based Explanation 


The approximation-based explanation has been widely used to analyze the predic- 
tion of models with complex structures. Approximation-based approaches could be 
further divided into white-box approximation and black-box approximation. The 
white-box approximation uses information inside the model, which includes but is 
not limited to gradients, intermediate features, model parameters, etc. The black-box 
approximation does not utilize information propagation inside the model. It usually 
uses a simple and interpretable model to fit the target model’s decision on an input 
instance. Then, the explanation can be easily extracted from the simple model. The 
details of commonly used methods for both categories are introduced as below. 


7.2.2.1 White-Box Approximation Method 


Sensitivity Analysis (SA) Baldassarre and Azizpour (2019) study the impact of a 
particular change in an independent variable on a dependent variable. In the context 
of explanation, the dependent variable refers to the prediction, while the independent 
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variables refer to the features. The local gradient of the model is commonly used as 
sensitivity scores to represent the correlation between the feature and the prediction 
result. The sensitivity score is defined as: 


S (x) = FA, (7.4) 


where @ is the input instance graph to be explained, f(Y) is the model prediction 
function. Here x refers to the feature vector of a node of interest. Node features 
with higher sensitivity scores are more important because they can lead to drastic 
changes to model decisions. 

Although SA is intuitive and straightforward, its effectiveness is still limited. 
It assumes input features are mutually independent, and does not necessarily pay 
attention to their correlations in the actual decision-making process. Also, sensitivity 
analysis only measures the impact of local changes to the decision function f(Y), 
rather than thoroughly explaining the decision function value itself. Explanation 
results provided by sensitivity analysis are usually relatively noisy and challenging 
to comprehend. Therefore, some follow-up techniques have been developed trying 
to overcome this limitation (as shown in Fig. 7.4). 

GuidedBP(Baldassarre and Azizpour, 2019) is similar to SA except that it only 
detects the features that positively activate the neurons, with the assumption that 
negative gradients may confuse the contribution of important features and makes the 
visualizing noisy. To follow this intuition, GuideBP modifies the process of back- 
propagation of SA and discards all negative gradients. 

Grad © Input Sanchez-Lengeling et al (2020) measures the feature contribution 
scores as the element-wise product of the input features and the gradients of decision 
function with respect to the features: 


S (x) = Vi f(Y) Ox. (7.5) 


Therefore, Grad © Input considers not only the feature sensitivity, but also the scale 
of feature values. However, the methods mentioned above all suffered from the sat- 
uration problem, where the scope of the local gradients is too limited to reflect the 
overall contribution of each feature. 

Integrated Gradients (IG) Sanchez-Lengeling et al (2020) solve the saturation 
problem by aggregating feature contribution along a designed path in input space. 
This path starts from a chosen baseline point Y’ and ends at the target input Z. 
Specifically, the feature contribution is computed as: 


F(x) = (x-x’) [ve (Y' +a(¥-Y’')) da (7.6) 


where x’ denotes a feature vector in the baseline point Y’, while x is a feature vector 
in the original input Y. The choice of baseline ¥’ is relatively flexible. A typical 
strategy is to use a null graph as the baseline, which has the same topology but its 
nodes use “unspecified” categorical features. This is motivated by the application of 
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IG in explaining image classification models (Sundararajan et al, 2017), where the 
baseline is usually chosen as a pure black image or an image with random noises. 

The explanations obtained by the above methods usually contain a lot of noises. 
Therefore, Smilkov et al (2017) propose SmoothGrad to alleviate the problem. 
SmoothGrad averages attributions evaluated on a number of noise-perturbed ver- 
sions of the input. This method initially aims at sharpening the saliency maps on 
images. Furthermore, Sanchez-Lengeling et al (2020) apply it to the Grad © Input 
method by adding Gaussian noise to node and edge features, and averaging multiple 
explanations to a smoothed one. 

Class Activation Mapping (CAM) (Pope et al, 2019) is an explanation method 
that is initially developed for CNNs. This method only works under a specific model 
architecture, where the last convolutional layer is followed by a global average pool- 
ing (GAP) layer before the final softmax layer. The feature maps (i.e., activations) 
in the last convolutional layer are aggregated and re-scaled to the same size as the 
input image, so that the activations highlight the important regions in the image. The 
idea of CAM can also be adapted to graph neural networks. Specifically, the GAP 
layer ina GNN could be defined as averaging the embeddings of all nodes in the last 
graph convolution layer: h = 1 Ei HE, where L is the last graph convolution layer. 
CAM treats each dimension of the final node embeddings (i.e., HE) as a feature 
map. The logit value for class c is: 


f°(Y) =} wih: (7.7) 
k 


where h; denotes the k-th entry of h, w; is the GAP-layer weight of k-th feature map 
with respect to class c. Therefore, the contribution of node i to the prediction is: 


Lies 4 
CAO ae Vw Ah. (7.8) 
k 


Although CAM is simple and efficient, it only works on models with certain struc- 
tures, which greatly limits its application scenarios. 

Grad-CAM (Pope et al, 2019) combines gradient information with feature maps 
to relax the limitation of CAM. While CAM uses the GAP layer to estimate the 
weight of each feature map, Grad-CAM employs the gradient of output with respect 
to the feature maps to compute the weights, so that: 


c_ lof) 
w= PY JHE (7.9) 
P (i) = ReLU (Erir) ; (7.10) 
k 


The ReLU function forces the explanation to focus on the positive influence on the 
class of interest. Grad-CAM is equivalent to CAM for GNNs with only one fully- 
connected layer before output. Compared to CAM, Grad-CAM can be applied to 
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more GNN architectures, thus avoiding the trade-off between model explainability 
and capacity. 


7.2.2.2 Black-Box Approximation Methods 


Different from white-box approximation methods, black-box approximation meth- 
ods manage to bypass the need to obtain internal information of complex models. 
The general idea is to use models that are intrinsically interpretable (such as linear 
regressions, decision trees) to fit the complex model. Then, we can explain the de- 
cision based on the simple models. The fundamental assumption behind this is that: 
Given an input instance, the model’s decision boundary within the neighborhood 
of that instance can be well approximated by the interpretable model. The major 
challenge is how to define the neighborhood space given an input graph which is a 
discrete data structure. 

We introduce several approaches, including GraphLime (Huang et al, 2020c), 
RelEx (Zhang et al, 2020a), and PGM-Explainer (Vu and Thai, 2020). These meth- 
ods share a similar procedure: First, a neighborhood space is defined around the 
target instance. Second, data points are sampled within this space and their predic- 
tions are obtained after being fed into the target model. A training dataset is built, 
where each instance-label pair consists of a sampled point and its prediction. Finally, 
an interpretable model is trained by using the dataset. The key difference between 
these methods lies in two aspects, i.e., the definition of the neighborhood, and the 
choice of the interpretable model. 

GraphLime is a local explanation method for GNN predictions on graph nodes. 
Given the prediction result on a target node v;, GraphLime defines the neighborhood 
space as a set of nodes which are in the k-hop neighborhood of the target node in 
the input graph: 

V, = {v | distance(v;,,v) <k,vEV}, (7.11) 


where the k-hop neighborhood refers to the nodes which are within k hops from 
the target node. GraphLime collects the features of nodes in % as the corpus, and 
employs HSIC Lasso (Hilbert-Schmidt independence criterion Lasso) to measure 
the independence between features and predictions of the nodes. The top impor- 
tant features are selected as the explanation result, so GraphLime cannot provide 
explanations based on structural information of the graph. 

RelEx defines the neighborhood space as a set of perturbed graphs to the com- 
putation graph of the target node. Similar to GraphLime, RelEx explains GNN pre- 
dictions on nodes. The computation graph Y% of the target node v; is composed 
of the k-hop neighbor nodes around node v, and the edges that connect them. 
First, RelEx proposes a BFS sampling strategy to sample multiple perturbed graphs 
{4 1,4 o,---,G} from the computation graph. These perturbed graphs are fed into 
the original GNN f to build a training set {Y, f(Y;) }/_,. Then, a new GNN f” is 
trained upon the training set to approximate f. After that, a mask M is trained for 
explanation. The mask is applied to the adjacency matrix of Y. The value of each 
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mask entry is in [0,1], so it is a soft mask. There are two loss terms for training the 
mask: (1) f/(Y ©M) is close to f’(G), (2) the mask M is sparse. The resultant mask 
entry values indicate the importance score of edges in Y, where a higher mask value 
means the corresponding edge is more important. 

PGM-Explainer applies probabilistic graphical models to explain GNNs. To 
find the neighbor instances of the target, PGM-Explainer first randomly selects 
nodes to be perturbed from computation graphs. Then, the selected nodes’ features 
are set to the mean value among all nodes. After that, PGM-Explainer employs a 
pair-wise dependence test to filter out unimportant samples, aiming at reducing the 
computational complexity. Finally, a Bayesian network is introduced to fit the pre- 
dictions of chosen samples. Therefore, the advantage of PGM-Explainer is that it 
illustrates the dependency between features. 


7.2.3 Relevance-Propagation Based Explanation 


Relevance propagation redistributes the activation score of output neuron to its pre- 
decessor neurons, iterating until reaching the input neurons. The core of relevance 
propagation methods is about defining a rule for the activation redistribution be- 
tween neurons. Relevance propagation has been widely used to explain models in 
domains such as computer vision and natural language processing. Recently, some 
work has been proposed to explore the possibility of revising relevance propagation 
method for GNNs. Some representative approaches include LRP (Layer-wise Rel- 
evance Propagation) (Baldassarre and Azizpour, 2019; Schwarzenberg et al, 2019), 
GNN-LRP (Schnake et al, 2020), ExcitationBP (Pope et al, 2019). 

LRP is first proposed in (Bach et al, 2015) to calculate the contribution of indi- 
vidual pixels to the prediction result for an image classifier. The core idea of LRP is 
to use back propagation to recursively propagate the relevance scores of high-level 
neurons to low-level neurons, up to the input-level feature neurons. The relevance 
score of the output neuron is set as the prediction score. The relevance score that 
a neuron receives is proportional to its activation value, which follows the intu- 
ition that neurons with higher activation tend to contribute more to the prediction. 
In (Baldassarre and Azizpour, 2019; Schwarzenberg et al, 2019), the propagation 
rule is defined as below: 


+ 
Z: x 
R= ) sy pH 
Agere, te 7 (7.12) 


l 
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where RL, RIT! is the relevance score of neuron i in layer / and neuron j in layer 


1+ 1, respectively. x is the activation of neuron i in layer l. w; j is the connection 
weight. € prevents the denominator from being zero. This propagation rule only 
allows positive activation values. Also, explanations obtained using this method are 
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limited to nodes and node features, where graph edges are excluded. The reason 
is that the adjacency matrix is treated as part of the GNN model. Therefore, LRP 
is unable to analyze topological information which nevertheless plays an important 
role in graph data. 

ExcitationBP is a top-down attention model originally developed for CNNs 
(Zhang et al, 2018d). It shares a similar idea as LRP. However, ExcitationBP defines 
the relevance score as a probability distribution and uses a conditional probability 
model to describe the relevance propagation rule. 


P(aj) =P (a; | ai) P (ai) (7.13) 


where aj; is the j-th neuron in the lower layer and a; is the i-th parent neuron of 
aj in the higher layer. When the propagation process passes through the activation 
function, only non-negative weights are considered and negative weights are set to 
zero. To extend ExcitationBP for graph data, new backward propagation schemes 
are designed for the softmax classifier, the GAP (global average pooling) layer and 
the graph convolutional operator. 

GNN-LRP mitigates the weakness of traditional LRP by defining a new prop- 
agation rule. Instead of using the adjacency matrix to obtain propagation paths, 
GNN-LRP assigns the relevance score to a walk, which refers to a message flow 
path in the graph. The relevance score is defined by the T-order Taylor expansion of 
the model with respect to the incorporation operator (graph convolutional operator, 
linear message function, etc.). The intuition is that the incorporation operator with 
greater gradients has a greater influence on the final decision. 


7.2.4 Perturbation-Based Approaches 


An assumption behind prediction explanations is that important input parts signif- 
icantly contribute to the output while unimportant parts have minor influences. It 
thus implies that masking out the unimportant parts will have a negligible impact on 
the output, and masking out the important parts will have a significant impact. The 
goal is to find a mask M to indicate graph component importance. The mask could 
be applied to nodes, edges or features in graphs. The mask value can either be binary 
M; € {0,1} or continuous M; € [0,1]. Some recent perturbation-based approaches 
are introduced as below. 

GNNExplainer (Ying et al, 2019) is the first perturbation-based explanation 
method for GNNs. Given the model’s prediction on a node v, GNNExplainer tries 
to find a compact subgraph 4s from the computation graph of node v that is most 
crucial for the prediction. The problem is defined as maximizing the mutual in- 
formation (MI) between the predictions of the original computation graph and the 
predictions of the subgraph: 
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Gs 


where s and Xs is the subgraph and its nodes’ features. Y is the predicted label 
distribution, and its entropy H(Y) is a constant. To solve the optimization problem 
above, the authors apply a soft-mask M on adjacency matrix: 


C 
min — }_, 1y = c]logPe (Y =y | G =Ac © o(M),X =X), (7.15) 
c=1 


where A, is the adjacency matrix of the computation graph, Xe is the correspond- 
ing feature matrix, and M denotes the trainable parameters. The sigmoid function 
projects the mask value in [0,1]. Finally, a subgragh is built by selecting the edges 
(and the nodes connected by these edges) corresponding to the high values in M. Be- 
sides providing explanations based on graph structures, GNNExplainer could also 
offer feature-wise explanations by applying a similar masking learning process on 
features. Moreover, regularization techniques could be applied to enforce the expla- 
nation to be sparse. As a model-agnostic approach, GNNExplainer is suitable for 
any graph-based machine learning tasks and GNN models. 

PGExplainer (Luo et al, 2020) shares the same idea with GNNExplainer and 
learns a discrete mask applied on edges to explain the predictions. The main idea is 
to use a deep neural network to generate edge mask values: 


Mi į = MLPy ([z;;2;)) , (7.16) 


where ¥ denotes the trainable parameters of the MLP. zi and z/ are the embedding 
vector for node i and j, respectively. [-;-] denotes concatenation. Similar to the GN- 
NExplainer, the mask generator is trained by maximizing the mutual information 
between the original prediction and the new prediction. 

GraphMask (Schlichtkrull et al, 2021) also produces the explanation by estimat- 
ing the influences of edges. Similar to PGExplainer, GraphMask learns an erasure 
function that quantifies the importance of each edge. The erasure function is defined 
as: 

zhu) = 8x (b,a mf) (7.17) 


where h,,, h, and m, , refers to the hidden embedding vectors for node u, node v and 
the message sent through the edge in graph convolution. m% denotes the parameters 
of function g. One difference between GraphMask and PGExplainer is that the for- 
mer also takes the edge embedding as input. Another difference is that GraphMask 
provides the importance estimation for every graph convolution layer, and k indi- 
cates the layer that the embedding vectors belong to. Instead of directly erasing the 
influences of unimportant edges, the authors then propose to replace the message 
sent through unimportant edges as: 


wal’) = off} lf} + (1-2) -b®, (7.18) 
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where b) is trainable. The work shows that a large proportion of edges can be 
dropped without deteriorating the model performance. 

Causal Screening (Wang et al, 2021) is a model-agnostic post-hoc method that 
identifies a subgraph of input as an explanation from the cause-effect standpoint. 
Causal Screening exerts causal effect of candidate subgraph as the metric: 


S(%) = MI (do(Y = %);9) — MI(do(Y = 0);$) (7.19) 


where G is the candidate subgraph, k is the number of edges and MI is the mu- 
tual information. The intervention do(Y = %) and do( = Ø) means the model 
input receives treatment (feeding @ into the model) and control (feeding @ into the 
model), respectively. ŷ denotes the prediction when feeding the original graph into 
the model. Causal Screening uses a greedy algorithm to search for the explanation. 
Starting from an empty set, at each step, it adds one edge with the highest causal 
effect into the candidate subgraph. 

CF-GNNExplainer (Lucic et al, 2021) also proposes to generate counterfactual 
explanations for GNNs. Different from previous methods that try to find a sparse 
subgraph to preserve the correct prediction, CF-GNNExplainer proposes to find the 
minimal number edges to be removed such that the prediction changes. Similar to 
GNNExplainer, CF-GNNExplainer employs the soft mask as well. Therefore, it also 
suffers from the “introduced evidence” problem (Dabkowski and Gal, 2017), which 
means that non-zero or non-one values may introduce unnecessary information or 
noises, and thus influence the explanation result. 


7.2.5 Generative Explanation 


Many methods introduced in previous subsections define the explanation as select- 
ing sub-graphs that contains part of nodes, edges or features of the original input. 
Recently, XGNN (Yuan et al, 2020b) proposes to obtain explanation by generating 
a graph that maximizes the prediction of the given GNN model. Some methods that 
share a similar idea have been proposed for computer vision tasks. For example, the 
role of a neuron could be understood by finding the input prototypes that maximally 
activates the neuron’s activation (Olah et al, 2018). The problem of finding proto- 
type samples can be defined as an optimization problem, which can be solved by 
gradient ascent. However, this method can not be directly used on GNNs because 
the gradient ascent method is not compatible with the discrete and topological na- 
ture of graph data. To tackle this problem, XGNN defines graph generation as a 
reinforcement learning task. 

To be more specific, the generator follows the steps below. First, it randomly 
picks one node as the initial graph. Second, given an intermediate graph, the gener- 
ator adds a new edge to the graph. This action is carried out in two steps: choosing 
the edge’s starting point as well as the end point. XGNN employs another GNN as 
the policy to determine the action. The GNN learns nodes features, and two MLPs 
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then take the learned features as input to predict the possibility of a start point and 
an endpoint. The endpoint and the edge between the two points are added to update 
the intermediate graph as an action. Finally, it calculates the reward of the action, so 
that we can train the generator via policy gradient algorithms. The reward consists 
of two terms. The first term is the score of the intermediate graph after feeding it to 
the target GNN model. The second one is a regularization term that guarantees the 
validity of the intermediate graph. The above steps are executed repeatedly until the 
number of action steps reaches the predefined upper limit. As a generative explana- 
tion method, XGNN provides a holistic explanation for graph classification. There 
could be more generative explanation methods for other graph analysis tasks to be 
explored in the future. 


7.3 Interpretable Modeling on Graph Neural Networks 


Following the introduction in Section 7.1.3.2, we introduce two categories of in- 
terpretable modeling approaches, i.e., GNN models with attention mechanism and 
disentangled representation learning on graphs. 


7.3.1 GNN-Based Attention Models 


Attention mechanisms benefit model interpretability by highlighting relevant parts 
of the graph for the given task through attention scores. According to the graph 
types, we introduce attention models built upon homogeneous graphs and heteroge- 
neous graphs, respectively. 


7.3.1.1 Attention Models for Homogeneous Graphs 


Graph Attention Networks (GATs) enable assigning different weights to different 
node embeddings in a neighborhood when aggregating information (Veličković 
et al, 2018). Specifically, let hi denote the column-wise embedding of node i, then 
the embedding update is written as: 


hi,,=o( $ oj jWhi), (7.20) 
JEViLi} 


where œ; ; is the attention score, and % denotes the set of neighbors of node i. Also, 
GAT uses a shared parameter matrix W independent of the layer depth. The attention 
score is computed as: 


exp(e; j) 


, (1.21) 
Lievuti exp(eik) 


Qi j = softmax (e; j) = 
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Fig. 7.5: Left: An illustration of graph convolution with single head attentions by 
node 1 on its neighborhood. Middle: The linear transformation with a shared param- 
eter matrix. Right: The attention mechanism employed in (Veličković et al, 2018). 


where self-attention mechanism is applied, 
eij = LeakyReLU(a ' [Whj||Whj]), (7.22) 


where || denotes vector concatenation. In general, the attention mechanism can also 
be denoted as e; j; = attn(hi ,h/ ). Therefore, the attention mechanism is a single- 
layer neural network parameterized by a weight vector a. The attention score Qj j 
shows the importance of node j to node i. 

The above mechanism could also be extended with multi-head attention. Specif- 
ically, K independent attention mechanisms are executed in parallel, and the results 
are concatenated: 

hie =10( ) af Whi), (7.23) 
jeñu{i} 


where ak F is the normalized attention score in the k-th attention mechanism, and W* 
is the corresponding parameter matrix. 

Besides learning node embeddings, we could also apply attention mechanisms to 
learn a low-dimensional embedding for the whole graph (Ling et al, 2021). Suppose 
we are working on an information retrieval problem. Given a set of graphs {Yn}, 
1 <m < M, anda query q, we want to return the graphs that are most relevant to the 
query. The embedding of each graph Y,, with respect to q could be computed using 
the attention mechanism. In the first step, we could apply normal GNN propagation 
rules as introduced in Equation 7.2, to obtain the embeddings of nodes inside each 
graph. Let q denote the embedding of the query, and h'”” denote the embedding of 
node i in a graph Y,,. The embedding of graph Y,, with respect to the query can be 
computed as: 


1 : 
E _ . pim 
hy = Bl È Qiqh (1.24) 
m| j=] 
where Q; g = attn(h’””, q) is the attention score, and attn() is a certain attention func- 
tion. Finally, hg, can be used to compute the similarity of G,, to the query in the 
graph retrieval task. 
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7.3.1.2 Attention Models for Heterogeneous Graphs 


A heterogeneous network is a network with multiple types of nodes, links, and even 
attributes. The structural heterogeneity and rich semantic information bring chal- 
lenges for designing graph neural networks to fuse information. 


Definition 7.4. A heterogeneous graph is defined as Y = (V,&,6,W), where Y is 
the set of node objects and £ is the set of edges. Each node v € Y is associated with 
a node type ¢ (v), and each edge (i, j) € & is associated with an edge type y((i, j)). 


We introduce how the challenge in embedding could be tackled using Heteroge- 
neous graph Attention Network (HAN) (Wang et al, 2019m). Different from tradi- 
tional GNNs, information propagation on HAN is conducted based on meta-paths. 


Definition 7.5. A meta-path ® is defined as a path with the form v} > vp > 


eri Vi abbreviated as v;, Vi, -+ vj, with a composite relation r1 or2 0++-orj_}. 
To learn the embedding of node i, we propagate the embeddings from its neighbors 
within the meta-path. The set of neighbor nodes is denoted as Ae Considering 
that different types of nodes have different feature spaces, a node embedding is first 
projected to the same space h/ = Mobi . Here Mg, is the transformation matrix for 
node type @;. The attention mechanism in HAN is similar to GAT, except that we 
need to consider the type of meta-path that is currently sampled. Specifically, 


Z?—6( Y ath’), (7.25) 


JEK? 
where the normalized attention score is 


o,f = softmax(e#;) : softmax(attn(h’ hi"; ®)). (7.26) 
Given a set of meta-paths {@,,..., Pp}, we can obtain a group of node embeddings 
denoted as {z'!,...,2®"}. To fuse embeddings across different meta-paths, an- 
other attention algorithm is applied. The fused embedding is computed as: 


P 
z = Y Ba, 2°*, (7.27) 


p=1 


where the normalized attention score is 


1 . 
i È q' -MLP(z'*?)), (7.28) 
icey 


Bo, = softmax(we, ) = softmax( 
where q is a learnable semantic vector. MLP(-) denotes a one-layer MLP module. 
We, can be explained as the importance of the meta-path ®,. Besides modeling 
heterogeneous types of nodes and edges, HetGNN (Zhang et al, 2019b) extends 
the discussion by considering heterogeneity in node attributes (e.g., images, texts, 
categorical features). 
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Fig. 7.6: Using multiple embeddings to represent the interests of a user. Each em- 
bedding segment corresponds to one aspect in data (Liu et al, 2019a). 


7.3.2 Disentangled Representation Learning on Graphs 


Traditional representation learning is limited in interpretability due to the opacity 
of the representation space. Different from manual feature engineering where the 
meaning of each resultant feature dimension is specified, the meaning of each di- 
mension of the representation space is unknown. Representation learning on graphs 
also suffers from this limitation. To tackle this issue, several approaches have been 
proposed to enable assigning concrete meanings to different representation dimen- 
sions, thus improving the interpretability of representation learning on graphs. 


7.3.2.1 Is A Single Vector Enough? 


Many existing representation learning methods on graphs focus on learning a sin- 
gle embedding for each node. However, for those scenarios where some nodes have 
multiple facets, is a single vector enough to represent each node? Solving such a 
problem is of great practical value for applications such as recommender systems, 
where users could have multiple interests. In this case, we could use multiple em- 
beddings to represent each user, and each embedding corresponds to one interest. 
An example is shown in Fig. 7.6. Specifically, if h’ € RP denotes the embedding 
of node i, then hi = [h'!;h’;...;h'*], where hi’ € R?/K is the embedding for the 
k-th facet. There are two challenges in learning disentangled representations, i.e., 
how to discover the K facets, and how to distinguish the update of different embed- 
dings during the training process. The facets could be discovered in an unsupervised 
manner by using clustering, where each cluster represents a facet. In the following 
parts, we introduce several approaches for learning disentangled node embeddings 
on graphs. 
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Fig. 7.7: The high-level idea of learning the disentangled node embedding for a 
target node by using clustering or dynamic routing. 


7.3.2.2 Prototypes-Based Soft-Cluster Assignment 


We discuss the techniques in the context of recommender system design. Facets 
that represent item types are discovered as we learn user and item embeddings. 
Here we assume that each item only has one facet, while each user could still have 
multiple facets. The embedding of item ¢ is simply h‘, while the embedding of 
user u is h” = [h*!;h™?; ...; h]. Each item ¢ is associated with a one-hot vector 
C = [C1.1,C12,---,Cr,x], Where c, = 1 if t belongs to facet k, and c; = 0 otherwise. 
Besides node embeddings, we also need to learn a set of prototype embeddings 
{m4 are The one-hot vector is drawn from the categorical distribution as below: 


c ~ categorical (softmax([57.1,57.2,-557,K])), Sik = cos(h’,m*) /c, (7.29) 


where T is a hyper-parameter that scales the cosine similarity. Then, the probability 
of observing an edge (u,t) is 


K 
p(t\u,er) œ È. crx similarity (h’,h'*). (7.30) 
k=1 


Besides the fundamental learning process introduced above, the variational autoen- 
coder framework could also be applied to regularize the learning process (Ma et al, 
2019c). The item embeddings and prototype embeddings are jointly updated until 
convergence. The embedding of each user h“ is determined by aggregating the em- 
beddings of interacted items, where h”* collects embeddings from items that also 
belong to facet k. In the learning process, the cluster discovery, node-cluster assign- 
ments, and embedding learning are jointly conducted. 
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7.3.2.3 Dynamic Routing Based Clustering 


The idea of using dynamic routing for disentangled node representation learning is 
motivated by the Capsule Network (Sabour et al, 2017). There are two layers of 
capsules, i.e., low-level capsules and high-level capsules. Given a user u, the set of 
items that he has interacted with is denoted as %,. The set of low-level capsules 
is {ce }, i E€ %,, so each capsule is the embedding of an interacted item. The set of 
high-level capsules is {c$}, 1 <k < K, where c} represents the user’s k-th interest. 
The routing logit value b; between low-level capsule i and high-level capsule k 
is computed as: 
biz = (H)! Sc, (7.31) 


where S is the bilinear mapping matrix. Then, the intermediate embedding for high- 
level capsule k is computed as a weighted sum of low-level capsules, 


zi = Ł wiz Sel, 
ica 
vg Elba) 
O EK exp(biw) 


so w; can be seen as the attention weights connecting the two capsules. Finally, a 
“squash” function is applied to obtain the embedding of high-level capsules: 


(7.32) 


zę? zk 


V+ leg? lizz? 


c? = squash(z") = 


(7.33) 


The above steps constitute one iteration of dynamic routing. The routing process is 
usually repeated for several iterations to converge. When the routing finishes, the 
high-level capsules can be used to represent the user u with multiple interests, to be 
fed into subsequent network modules for inference (Li et al, 2019b), as shown in 
Fig. 7.7. 


7.4 Evaluation of Graph Neural Networks Explanations 


In this section, we introduce the setting for evaluating GNN explanations. This in- 
cludes the datasets that are commonly used for constructing and explaining GNNs, 
as well as the metrics that evaluate different aspects of explanations. 


7.4.1 Benchmark Datasets 


As more approaches have been proposed for explaining GNNs, a variety of datasets 
have been used to assess their effectiveness. As such a research direction is still 
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in the initial stage of development, a universally accepted benchmark dataset, such 
as the COCO dataset for image object detection, has not yet been proposed. Here 
we list a number of datasets that have been used for developing GNN explanation 
approaches, including synthetic datasets and real-world datasets. 


7.4.1.1 Synthetic Datasets 


It is difficult to evaluate explanations because there are no ground truths in datasets 
to compare with. A strategy to mitigate this problem is to use synthetic datasets. 
In this case, motifs designed by humans could be added to data to play the role as 
ground truths, and these motifs are assumed to be relevant to the learning task. Some 
synthetic graph datasets are listed as below. 


BA-Shapes (Ying et al, 2019): A Barabasi-Albert graph with 300 nodes, to 
which 80 house-shaped motifs are attached randomly. It is then further aug- 
mented by adding 10% random edges. 

BA-Community (Ying et al, 2019): A graph consists of two BA-Shapes, with 
node features in different BA-Shapes following different normal distributions 
to distinguish them. 

Tree-Cycle (Ying et al, 2019): A graph based on an eight-level balance tree, to 
which 80 hexagonal motifs are attached randomly to the tree. 

Tree-Grid (Ying et al, 2019): A graph similar to Tree-Cycle, but with 80 3-by-3 
grid motifs instead of the hexagonal motifs. 

Noisy BA-Community, Noisy Tree-Cycle, Noisy Tree-Grid (Lin et al, 2020a): 
These four datasets are obtained by adding 40 important and 10 unimportant 
node features to the corresponding datasets list above. This design can help to 
test a method’s ability to identify important node features. 

BA-2Motifs (Luo et al, 2020): A dataset contains 800 independent graphs that 
are obtained by adding either a pentagon motif or a house motif to the base BA 
graph. This dataset is designed for graph classification task while previous ones 
are for node classification task. 


7.4.1.2 Real-World Datasets 


Some examples of real-world graph datasets are listed as below. 


e MUTAG (Debnath et al, 1991): A dataset consisting of 4,337 molecule graphs 
that are labeled mutagenic or non-mutagenic. The nodes and edges in a graph 
represent the atoms and chemical bonds. Related studies have shown that 
molecules with carbon rings and Nitro group (NO2) may lead to mutagenic 
effects. Also, there are several other molecule datasets, such as BBBP, BACE 
and TOX21 (Pope et al, 2019). 

REDDIT-BINARY (Yanardag and Vishwanathan, 2015): A online-discussion 
interaction dataset. It contains 2,000 graphs, and each of them is labeled as a 
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question-answer based or a discussion based community. The nodes and edges 
represent the users and their interactions, respectively. 

Delaney Solubility (Delaney, 2004): A molecule dataset with 1,127 molecule 
graphs, and their labels are the water-octanol partition coefficient. This dataset 
is usually for graph regression tasks. 

Bitcoin-Alpha, Bitcoin-OTC (Kumar et al, 2016): Two trust-weighted signed 
networks. Each of them consists of a graph whose nodes are accounts trading on 
the Bitcoin-Alpha or Bitcoin-OTC platform. The nodes are labeled trustworthy 
or not according to other members’ ratings. 

MNIST SuperPixel-Graph (Dwivedi et al, 2020): An image dataset in the 
form of graphs. Each sample is a graph converted from the corresponding image 
in the MNIST dataset. Every node is a super-pixel that represents the intensity 
of corresponding region. 


7.4.2 Evaluation Metrics 


An appropriate evaluation metric is crucial for methods comparison. Explanation 
visualization such as heat-map, due to its intuitiveness, has been widely used in 
explanation for image and text data. However, it loses this advantage since graph 
data is not intuitive to understand. Only experts with the domain knowledge can 
make judgment. In this section, we introduce several commonly-used metrics. 


e Accuracy is only appropriate for datasets with ground truth. The synthetic 
datasets usually contain the ground truth that is defined by the rule they are con- 
structed. For example, in molecule datasets, the molecule with NO% and carbon 
ring is mutagenic. Considering that carbon ring also occurs in non-mutagenic 
molecule, the NO2 group is considered as ground truth. F1 score and ROC-AUC 
are commonly used accuracy metrics. The limitation of the accuracy metrics is 
that it is unknown whether the GNN model makes predictions in the same way 
as humans (i.e., whether the pre-defined ground truth is really valid). 

Fidelity (Pope et al, 2019) follows the intuition that removing the truly im- 
portant features will significantly decrease the model performance. Formally, 
fidelity is defined as: 


N 
fidelity = EF (P (4) — P (N2) (7.34) 


i=1 


where f is the output function target model. Y; is the i-th graph, Y is the ex- 
planation for it, and Y \ Y represents the perturbed i-th graph in which the 
identified explanation is removed. 

Contrastivity (Pope et al, 2019) uses Hamming distance to measure the dif- 
ferences between two explanations. These two explanations correspond to the 
model’s prediction of one instance for different classes. It is assumed that mod- 
els would highlight different features when making predictions for different 
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classes. The higher the contrastivity, the better the performance of the inter- 
preter. 

e Sparsity (Pope et al, 2019) is calculated as the ratio of explanation graph size 
to input graph size. In some cases, explanations are encouraged to be sparse, 
because a good explanation should include only the essential features as far as 
possible and discard the irrelevant ones. 

e Stability (Sanchez-Lengeling et al, 2020) measures the performance gap of the 
interpreter before and after adding noise to the explanation. It suggests that a 
good explanation should be robust to slight changes in the input that do not 
affect the model’s prediction. 


7.5 Future Directions 


Interpretation on graph neural networks is an emerging domain. There are still many 
challenges to be tackled. In this section, we list several future directions towards 
improving the interpretability of graph neural networks. 

First, some online applications require real-time responses from models and al- 
gorithms. It thus puts forward high requirements on the efficiency of explanation 
methods. However, many GNN explanation methods conduct sampling or highly 
iterative algorithms to obtain the results, which is time-consuming. Therefore, one 
future research direction is how to develop more efficient explanation algorithms 
without significantly sacrificing explanation precision. 

Second, although more and more methods have been developed for interpreting 
GNN models, how to utilize interpretation towards identifying GNN model defects 
and improving model properties is still rarely discussed in existing work. Will GNN 
models be largely affected by adversarial attacks or backdoor attacks? Can interpre- 
tation help us to tackle these issues? How to improve GNN models if they have been 
found to be biased or untrustworthy? 

Third, besides attention methods and disentangled representation learning, are 
there other modeling or training paradigms that could also improve GNN inter- 
pretability? In the interpretable machine learning domain, some researchers are in- 
terested in providing causal relations between variables, while some others prefer 
using logic rules for reasoning. Therefore, how to bring causality into GNN learn- 
ing, or how to use incorporate logic reasoning into GNN inference, may be an inter- 
esting direction to explore. 

Fourth, most existing efforts on interpretable machine learning have been de- 
voted to get more accurate interpretation, while the human experience aspect is usu- 
ally overlooked. For end-users, friendly interpretation can promote user experience, 
and gain their trust to the system. For domain experts without machine learning 
background, an intuitive interface helps integrate them into the system improvement 
loop. Therefore, another possible direction is how to incorporate human-computer 
interaction (HCI) to show explanation in a more user-friendly format, or how to de- 
sign better human-computer interfaces to facilitate user interactions with the model. 
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Editor’s Notes: Similar to the general trend in the machine learning do- 
main, explainability has been ever more widely recognized as an important 
metric for graph neural networks in addition to those well recognized be- 
fore such as effectiveness (Chapter 4), complexity (Chapter 5), efficiency 
(Chapter 6), and robustness (Chapter 8). Explainability can not only broadly 
influence technique development (e.g., Chapters 9-18) by informing model 
developers of useful model details, but also could benefit domain experts in 
various application domains (e.g., Chapters 19-27) by providing them with 
explanations of predictions. 
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Abstract Graph neural networks have achieved impressive results in various graph 
learning tasks and they have found their way into many applications such as molec- 
ular property prediction, cancer classification, fraud detection, or knowledge graph 
reasoning. With the increasing number of GNN models deployed in scientific ap- 
plications, safety-critical environments, or decision-making contexts involving hu- 
mans, it is crucial to ensure their reliability. In this chapter, we provide an overview 
of the current research on adversarial robustness of GNNs. We introduce the unique 
challenges and opportunities that come along with the graph setting and give an 
overview of works showing the limitations of classic GNNs via adversarial example 
generation. Building upon these insights we introduce and categorize methods that 
provide provable robustness guarantees for graph neural networks as well as prin- 
ciples for improving robustness of GNNs. We conclude with a discussion of proper 
evaluation practices taking robustness into account. 


8.1 Motivation 


The success story of graph neural networks is astonishing. Within a few years, they 
have become a core component of many deep learning applications. Nowadays they 
are used in scientific applications such as drug design or medical diagnoses, are 
integrated in human-centered applications like fake news detection in social media, 
get applied in decision-making tasks, and even are studied in safety-critical environ- 
ments like autonomous driving. What unites these domains is their crucial need for 
reliable results; misleading predictions are not only unfortunate but indeed might 
lead to dramatic consequences — from false conclusions drawn in science to harm 
for people. However, can we really trust the predictions resulting from graph neural 
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networks? What happens when the underlying data is corrupted or even becomes 
deliberately manipulated? 

Indeed, the vulnerability of classic machine learning models to (deliberate) per- 
turbations of the data is well known (Goodfellow et al, 2015): even only slight 
changes of the input can lead to wrong predictions. Such instances, for humans 
nearly indistinguishable from the original input yet wrongly classified, are also 
known as adversarial examples. One of the most well-known and alarming exam- 
ples is an image of a stop sign, which is classified as a speed limit sign by a neural 
network with only very subtle changes to the input; though, for us as humans it still 
clearly looks like a stop sign (Eykholt et al, 2018). Examples like these illustrate 
how machine learning models can dramatically fail in the presence of adversarial 
perturbations. Consequently, adopting machine learning for safety-critical or sci- 
entific application domains is still problematic. To address this shortcoming, many 
researchers have started to analyze the robustness of models in domains like images, 
natural language, or speech. Only recently, however, GNNs have come into focus. 
Here, the first work studying GNNs’ robustness (Ziigner et al, 2018) investigates 
one of the most prominent tasks, node-level classification, and demonstrated the 
susceptibility of GNNs to adversarial perturbations as well (see Figure 8.1). Since 
then, the field of adversarial robustness on graphs has been rapidly expanding, with 
many works studying diverse tasks and models, and exploring ways to make GNNs 
more robust. 


Fig. 8.1 The upper left graph 


is the original input. On the baton 
right is the graph after per- 
forming a small change (e.g. 

i A ™ target node 


adding an edge or changing 
some node attributes). The node classification:via 
lower part illustrates the pre- a graph neural network _ vy 


dicted classes for each node 

obtained from a GNN. Is it 

possible to change the predic- TA a 
tions? Are GNNs robust? 


To some degree it is surprising that graphs were not in the focus even earlier. 
Corrupted data and adversaries are common in many domains where graphs are 
analyzed, e.g., social media and e-commerce systems. Take for example a GNN- 
based model for detecting fake news in a social network (Monti et al, 2019; Shu et al, 
2020). Adversaries have a strong incentive to fool the system in order to avoid being 
detected. Similarly, in credit scoring systems, fraudsters try to disguise themselves 
by creating fake connections. Thus, robustness is an important concern for graph- 
based learning. 

It is important to highlight, though, that adversarial robustness is not only a topic 
in light of security concerns, where intentional changes, potentially crafted by hu- 
mans, are used to try to fool the predictions. Instead, adversarial robustness con- 
siders worst-case scenarios in general. Especially in safety-critical or scientific ap- 
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plications where reliability is key, understanding the robustness of GNNs to worst- 
case noise is important, as nature itself might be the adversary. The construction of 
gene interaction networks, for example, often leads to corrupted graphs containing 
spurious edges (Tian et al, 2017). Thus, to make sure that graph neural networks 
work reliably in all these scenarios, we need to investigate robustness under worst- 
case/adversarial corruptions of the data. 

Moreover, non-robustness of GNNs shows conceptual gaps: while neural net- 
works are hypothesized to learn meaningful representations that capture the seman- 
tics of the domain and task, a non-robust model clearly violates this property. Since 
the small changes leading to an adversarial example do not alter the meaning, a rea- 
sonable representation should also not change the prediction. Thus, understanding 
adversarial robustness means understanding generalization performance. 


Unique Challenges in the Graph Domain 


In contrast to other application domains of deep learning, robustness analysis for 
graphs is especially challenging for multiple reasons: 


1. Complex perturbation space: Changes can manifest in various ways including 
perturbations in the graph structure and the node attributes, leading to a vast 
space to explore. Importantly, unlike other fields this often means operating in a 
discrete data domain such as adding or removing edges, leading to hard discrete 
optimization problems as we will see later. 

2. Interdependent data: The core feature of GNNs is to exploit the interdependence 
between instances, for example, in the form of message passing or graph convo- 
lution. Perturbations to the graph structure change the message passing scheme, 
modifying how learned representation are propagated. Specifically, changes to 
one part of the graph, e.g. one node, might affect many other instances. 

3. Notion of similarity: We expect GNN models to be robust to small changes in 
the graph. If the graphs are almost indistinguishable, the predictions should be 
the same. However, defining the notion of similarity between graphs itself is 
a hard problem and unlike, e.g., images, manual inspection by a human is not 
practical. 


Given these challenges, in the following Section 8.2 we first introduce the prin- 
ciple of adversarial attacks on GNNs and highlight some non-robustness results. 
In Section 8.3, we give an overview of robustness certificates, providing ways for 
proving the reliability of predictions, followed by Section 8.4 where approaches 
for improving GNNs’ robustness are introduced. We conclude in Section 8.5 with 
discussing aspects of proper evaluation. 


152 Stephan Giinnemann 


8.2 Limitations of Graph Neural Networks: Adversarial 
Examples 


To understand the (non-)robustness of GNNs, we can try to construct worst-case 
perturbations — finding a small change of the data, which in consequence leads to a 
strong change in the GNN’s output. This is also known as performing an adversarial 
attack and the resulting perturbed data is often called an adversarial example.! While 
random perturbations of the data often have minor effect, specific perturbations, in 
contrast, can be dramatic. Accordingly, an attack is often phrased as an optimization 
problem with the goal to find a perturbation of the data which maximizes some 
attack objective (e.g., maximize the predicted probability of some incorrect class). 


8.2.1 Categorization of Adversarial Attacks 


Before providing a general definition of adversarial attacks, it is helpful to distin- 
guish two very different notions, called poisoning vs. evasion scenarios. The dif- 
ference lies in the stage of the learning process in which the data perturbation is 
performed. In a poisoning scenario, the perturbation is injected before the training 
of the model; the perturbed data, thus, also affects the learning and the final model 
we obtain. In contrast, an evasion scenario assumes the model to be given, i.e., al- 
ready trained and fixed, and the perturbation is applied to future data during the 
application/test phase of the GNN. It is worth to highlight, that for the frequently 
considered transductive learning setting of GNNs — where we have no future test 
data, but only the given (un)labeled data — a poisoning scenario is the more natural 
choice. Though, in principle any combination of learning (transductive vs. induc- 
tive) and attack scenario (poisoning vs. evasion) is worth to be studied. 

Given this basic distinction, performing a poisoning adversarial attack can be 
generally formulated as a bi-level optimization problem 


max = Oax(for(Y)) st. O* =argmin rain(fo(Y)) (8.1) 
EDL) ( 
Here &(¥Y) denotes the set of all graphs we are treating as indistinguishable to the 
given graph Y at hand, and G denotes a specific perturbed graph from this set. For 
example, ®(Y) could capture all graphs which differ from Y in at most ten edges 
or in a few node attributes. The attacker’s goal is to find a graph @G that, when 
passed through the GNN fg», maximizes a specific objective Qa, €.g., increasing 
the predicated probability of a certain class for a specific node. Importantly, in a 
poisoning setting, the weights 0* of the GNN are not fixed but learned based on 
the perturbed data, leading to the inner optimization problem that corresponds to 
the usual training procedure on the (now perturbed) graph. That is, 0* is obtained 


l Again it is worth highlighting that such ‘attacks’ are not always due to human adversaries. Thus, 
the terms ‘change’ or ‘perturbation’ might be better suited and have a more neutral connotation. 
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by minimizing some training loss train on the graph G. This nested optimization 
makes the problem specifically hard. 

To define an evasion attack, the above equation can simply be changed by assum- 
ing the parameter 0* to be fixed. Often it is assumed to be given by minimizing the 
training loss w.r.t. the given graph Ẹ (i.e. 0* = arg ming Yirain( fo(Y))). This makes 
the above scenario a single-level optimization problem. 

This general form of an attack enables us to provide a categorization along dif- 
ferent aspects and illustrates the space to explore for robustness characteristics of 
GNNs in general. While this taxonomy is general, for ease of understanding, it helps 
to think about an intentional attacker. 


Aspect 1: Property under Investigation (Attacker’s Goal) 


What is the robustness property we want to analyze? For example, do we want to 
understand how robust the classification of an individual node is? Will it change 
when perturbing the data? The property under investigation is modeled via @aix. It 
intuitively represents the attacker’s goal. If Ga, for example measures the difference 
between a node’s ground truth label and the currently predicted one, maximizing this 
difference in Eq. equation 8.1 tries to enforce a misclassification. 

The attacker’s goal is highly task-dependent. The majority of existing works has 
focused on the robustness of node-level classification based on GNNs, where we 
have to distinguish two scenarios. Works such as (Ziigner et al, 2018; Dai et al, 
2018a; Wang and Gong, 2019; Wu et al, 2019b; Chen et al, 2020f; Wang et al, 
2020c) investigate how the prediction of an individual target node changes under 
perturbations — also called local attack. In contrast, Ziigner and Giinnemann (2019); 
Wu et al (2019b); Liu et al (2019c); Ma et al (2020b); Geisler et al (2021); Sun et al 
(2020d) have investigated how the overall performance on an entire set of nodes 
can drop — called a global attack.” This seemingly subtle difference between the 
two scenarios is crucial: In the latter case one has to find a single perturbed graph 
Ge (G) which simultaneously changes many predictions, taking into account that 
all node-level predictions are indeed done jointly based on one input. In the former 
case, for each individual target node v; a different perturbation G. € (F) can be 
selected. Both views are reasonable; they simply model different aspects. 

Beyond node-level classification, further works have investigated robustness of 
graph-level classification (Chen et al, 2020)), link prediction (Chen et al, 2020h; Lin 
et al, 2020d), and node embeddings (Bojchevski and Günnemann, 2019; Zhang et al, 
2019e). The last one is worth mentioning since it targets an unsupervised learning 
setting, aiming to be task-agnostic. Unlike the other examples, the goal is not to 
target one specific task but to perturb the quality of the embeddings in general such 
that one or multiple downstream tasks are hindered. Since it is not known a priori 
which tasks (classification, link prediction, etc.) will be performed based on the 


2 Local attacks have also been called targeted attacks, while global ones untargeted. Since this, 
however, leads to a name clash with categorizations used in other communities (Carlini and Wag- 
ner, 2017) we decided to use local/global here. 
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node embeddings, defining the objective @, is challenging. As a proxy measure, 
Bojchevski and Giinnemann (2019) for example uses the training loss itself, setting 
Oat = Lirain- 


Aspect 2: The Perturbation Space (Attacker’s Capabilities) 


What changes are allowed to the original graph? What do we expect the perturba- 
tions to look like? For example, do we want to understand how deleting a few edges 
influences the prediction? The space of perturbations under consideration is mod- 
eled via (Y). It intuitively represents the attacker’s capabilities; what and how 
much they are able to manipulate. The complexity of the perturbation space for 
graphs represents one of the biggest differences to classical robustness studies and 
stretches along two dimensions. 

(1) What can be changed? Unique to the graph domain are perturbations of the 
graph structure. In this regard, most publications have studied the scenarios of re- 
moving or adding edges to the graph (Dai et al, 2018a; Wang and Gong, 2019; 
Ziigner et al, 2018; Ziigner and Giinnemann, 2019; Bojchevski and Giinnemann, 
2019; Zhang et al, 2019e; Ziigner et al, 2018; Tang et al, 2020b; Chen et al, 2020f; 
Chang et al, 2020b; Ma et al, 2020b; Geisler et al, 2021). Focusing on the node level, 
some works (Wang et al, 2020c; Sun et al, 2020d; Geisler et al, 2021) have consid- 
ered adding or removing nodes from the graph. Beyond the graph structure, GNN 
robustness has also been explored for changes to the node attributes (Ziigner et al, 
2018; Wu et al, 2019b; Takahashi, 2019) and the labels used in semi-supervised 
node classification (Zhang et al, 2020b). 

An intriguing aspect of graphs is to investigate how the interdepenence of in- 
stances plays a role in robustness. Due to the message passing scheme, changes to 
one node might affect (potentially many) other nodes. Often, for example, a node’s 
prediction depends on its k-hop neighborhood, intuitively representing the node’s 
receptive field. Thus, it is not only important what type of change can be performed 
but also where in the graph this can happen. Consider for example Figure 8.1: to 
analyze whether the prediction for the highlighted node can change, we are not lim- 
ited to perturbing the node’s own attributes and its incident edges but we can also 
achieve our aim by perturbing other nodes. Indeed, this reflects real world scenarios 
much better since it is likely that an attacker has access to a few nodes only, and 
not to the entire data or the target node itself. Put simply, we also have to consider 
which nodes can be perturbed. Multiple works (Ziigner et al, 2018; Zhang et al, 
2019e; Takahashi, 2019) investigate what they call indirect attacks (or sometimes 
influencer attacks), specifically analyzing how an individual node’s prediction can 
change when only perturbing other parts of the graph while leaving the target node 
untouched. 

(2) How much can be changed? Typically, adversarial examples are designed to 
be nearly indistinguishable to the original input, e.g., changing the pixel values of an 
image so that it stays visually the same. Unlike image data, where this can easily be 
verified by manual inspection, this is much more challenging in the graph setting. 
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Technically, the set of perturbations can be defined based on any graph distance 
function D measuring the (dis)similarity between graphs. All graphs similar to the 
given graph Y then define the set P(X) = {Y € G | D(Y,Y) < A}, where G denotes 
the space of all potential graphs and A the largest acceptable distance. 

Defining what are suitable graph distance functions is in itself a challenging 
task. Beyond that, computing these distances and using them within the optimiza- 
tion problem of Eq. equation 8.1 might be computationally intractable (think, e.g., 
about the graph edit distance which itself is NP-hard to compute). Therefore, exist- 
ing works have mainly focused on so called budget constraints, limiting the number 
of changes allowed to be performed. Technically, such budgets correspond to the 
Lo pseudo-norm between the clean and perturbed data, e.g., relating to the graphs’ 
adjacency matrix A or its node attributes X.? To enable more fine-grained control, 
often such budget constraints are used locally per node (e.g., limiting the maximal 
number of edge deletions per node; Alec) as well as globally (e.g., limiting the over- 
all number of edge deletions; A®'©). For example 


(G) ={G = (A,X) € G | ||A—Al|p < A2% ^ Vi : ||A; —Ai||o < AI AX =X}, 
(8.2) 
where the graphs Y = (A,X) and Y = (A,X) are assumed to have the same size and 
the node attributes, X resp. Ê, to stay unchanged; A; denotes the ith row of A. 
Beyond these budget constraints, it might be useful to preserve further character- 
istics of the data. In particular for real-world networks many patterns such as spe- 
cific degree distributions, large clustering coefficients, low diameter, and more are 
known to hold (Chakrabarti and Faloutsos, 2006). If two graphs show very different 
patterns, it is easy to tell them apart — and a different prediction could be expected. 
Therefore, in (Ziigner et al, 2018; Ziigner and Giinnemann, 2019; Lin et al, 2020d) 
only perturbed graphs are considered which follow similar power-law behavior in 
the degree distribution. Similarly, one can impose constraints on the attributes con- 
sidering, e.g., the co-occurrence of specific values. 


Aspect 3: Available Information (Attacker’s Knowledge) 


What information is available to find a harmful perturbation? What is the attacker’s 
knowledge about the system? Considering a human-like adversary, the more knowl- 
edge is available, the stronger are the potential attacks. 

In general, we have to distinguish between knowledge about the data/graph and 
knowledge about the model. For the first, either the full graph could be known or 
only parts of it as, e.g., investigated in (Ziigner et al, 2018; Dai et al, 2018a; Chang 
et al, 2020b; Ma et al, 2020b). While for worst-case analysis we often assume that 
the attacker has full knowledge, for practical scenarios it is indeed realistic to assume 
that an attacker only observes subsets of the data. For supervised learning settings, 


3 This is a similar approach to image data, where often we take a certain radius as measured by, 
e.g., an Lp norm around the original input as the allowed perturbation set, assuming that for small 
radii the semantic meaning of the input does not change. 
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the ground-truth labels of the target node(s) could additionally be hidden from the 
attacker. The knowledge about the model includes many aspects such as knowledge 
about the used GNN architecture, the model’s weights, or whether only the output 
predictions or the gradients are known. Given all these variations, the most common 
ones are white-box settings, where full information is available, and black-box set- 
tings, which usually mean that only the graph and potentially the predicted outputs 
are available. 

Among the three aspects above, the attacker’s knowledge seems to be the one 
which most strongly links to human-like adversaries. It should be highlighted, 
though, that worst-case perturbations in general are best reflected by the fully white- 
box setting, making it the preferred choice for strong robustness results. If a model 
performs robustly in a white-box setting, it will also be robust under the limited 
scenarios. Moreover, as we will see in Section 8.2.2.1, the transferability of attacks 
implies that knowledge about the model is not really required. 


Aspect 4: The Algorithmic View 


Besides the above categorization that focuses on the properties of the attack, an- 
other, more technical, view can be taken by considering the algorithmic approach 
how the (bi-level) optimization problem is solved. In the discussion of the pertur- 
bation space we have seen that graph perturbations often relate to the addition/re- 
moval of edges or nodes — these are discrete decisions, making Eq. equation 8.1 a 
discrete optimization problem. This is in stark contrast to other data domains where 
infinitesimal changes are possible. Thus besides adapting gradient-based approxi- 
mations, various other techniques can be used to tackle Eq. equation 8.1 for GNNs 
such as reinforcement learning (Sun et al, 2020d; Dai et al, 2018a) or spectral ap- 
proximations (Bojchevski and Giinnemann, 2019; Chang et al, 2020b). Moreover, 
the attacker’s knowledge has also implications on the algorithmic choice. In a black- 
box setting where, e.g., only the input and output are observed, we cannot use the 
true GNN fg to compute gradients but have to use other principles like first learning 
some surrogate model. 


8.2.2 The Effect of Perturbations and Some Insights 


The above categorization shows that various kinds of adversarial perturbations under 
different scenarios can be investigated. Summarizing the different results obtained 
in the literature so far, the trend is clear: standard GNNs trained in the standard way 
are not robust. In the following, we given an overview of some key insights. 

Figure 8.2 illustrates one of the results of the method Nettack as introduced in 
(Ziigner et al, 2018). Here, local attacks in an evasion setting focusing on graph 
structure perturbations are analyzed for a GCN (Kipf and Welling, 2017b). The 
figure shows the classification margin, i.e., the difference between the predicted 
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Fig. 8.2 Performing local 
structure attacks on a GCN 
model and the Cora ML data 
with the Nettack (Ziigner et al, 
2018) approach. If a node is 
below the dashed line it is 
misclassified w.r.t. the ground 
truth label. As shown, almost 
any node’s prediction can be Original Nettack Nettack-In. Nettack Nettack-In. 
changed. Graph Budget A = |d/2] Budget A = d 


Classification margin 


probability of the node’s true class minus the one of the second highest class. The 
left column shows the results for the unperturbed graph where most nodes are cor- 
rectly classified as illustrated by the predominantly positive classification margin. 
The second column shows the result after perturbing the graph based on the pertur- 
bation found by Nettack using a global budget of A = |d,/2| and making sure that 
no singletons occur where d, is the degree of the node v under attack. Clearly, the 
GCN model is not robust: almost every node’s prediction can be changed. Moreover, 
the third column shows the impact of indirect attacks. Recall that in these scenarios 
the performed perturbations cannot happen at the node we aim to misclassify. Even 
in this setting, a large fraction of nodes is vulnerable. The last two columns show 
results for an increased budget of A = d,. Not surprisingly, the impact of the attack 
becomes even more pronounced. 

Considering global attacks in the poisoning setting similar behavior can be ob- 
served. For example, when studying the effect of node additions, the work (Sun et al, 
2020d) reports a relative drop in accuracy by up to 7 percentage points with a bud- 
get of 1% of additional nodes, without changing the connectivity between existing 
nodes. For changes to the edge structure, the work (Ziigner and Giinnemann, 2019) 
reports performance drops on the test sets by around 6 to 16 percentage points when 
perturbing 5% of the edges. Noteworthy, on one dataset, these perturbations lead to 
a GNN obtaining worse performance than a logistic regression baseline operating 
only on the node attributes, i.e., ignoring the graph altogether becomes the better 
choice. 

The following observation from (Ziigner and Giinnemann, 2019) is important 
to highlight: One core factor for the obtained lower performance on the perturbed 
graphs are indeed the learned GNN weights. When using the weights 04% trained on 
the perturbed graph @G obtained by the poisoning attack, not only the performance 
on Ê is low but even the performance on the unperturbed graph Y suffers dramat- 
ically. Likewise, when applying weights @y trained on the unperturbed graph ¥ to 
the graph G, the classification accuracy barely changes. Thus, the poisoning attack 
performed in (Ziigner and Giinnemann, 2019) indeed derails the training procedure, 
i.e., leads to ‘bad’ weights. This result emphasizes the importance of the training 
procedure for the performance of graph models. If we are able to find appropriate 
weights, even perturbed data might be handled more robustly. We will encounter 
this aspect again in Section 8.4.2. 
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8.2.2.1 Transferability and Patterns 


An interesting question to investigate is the adversarial examples’ transferability. 
Transferability relates to the fact that a harmful perturbation for one model (e.g. a 
GCN) is also harmful for another model (e.g. GAT (Veličković et al, 2018)). Thus, 
one can simply reuse one perturbation to fool many models. The transferability of 
GNN attacks has been investigated in multiple works (Ziigner et al, 2018; Ziigner 
and Giinnemann, 2019; Lin et al, 2020d; Chen et al, 2020f) and seems to hold across 
many models. For example, local attacks computed based on Nettack’s GCN-like 
surrogate model in an evasion scenario are also harmful for the original GCN and 
the Column Network (Pham et al, 2017) model; for evasion and poisoning alike. 
Interestingly, the performance gets detoriated even for unsupervised node embed- 
dings such as Deep Walk (Perozzi et al, 2014), combined with a subsequent logistic 
regression to obtain predictions. 

The wide transferability of adversarial perturbations could be an indicator that 
they follow general patterns. There seems to be some systematic change of the graph 
which hinders many GNN models to perform well. If we can find out what makes, 
for example, an edge insertion a strong adversarial change, we can use this knowl- 
edge to detect adversarial attacks and/or make graph neural networks more robust 
(see Section 8.4). However, it is yet still not fully understood what makes these 
adversarial attacks harmful to a variety of models. 

In (Zhang et al, 2019b) the predicted categorical distributions over classes for 
perturbed and unperturbed instances after performing a Nettack attack has been 
analyzed. Inspecting the average KL-divergence of the predicted categorical dis- 
tributions of a node and its neighbors, perturbed nodes seem to show higher diver- 
gences, i.e., the attacks appear to be aiming to violate the homophily assumption 
in the graph. Relatedly, Wu et al (2019b) compared the Jaccard similarity between 
adjacent node’s attributes and noticed a change in distribution from the clean and 
perturbed graph. The work (Ziigner et al, 2020) investigated various graph proper- 
ties, including aspects such as node degree, closeness centrality, PageRank (Brin 
and Page, 1998) scores, or attribute similarity. They focused on structure attacks 
using Nettack, allowing only edge insertions and deletions to the target node. 
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Fig. 8.3: Cumulative distributions of properties of nodes connected to (Insert) or 
disconnected from (Remove) the target node by the Nettack method. Baseline is the 
distribution in the entire graph. 
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Figure 8.3 compares the distribution of such a property (e.g. node degree) when 
considering all nodes of the unperturbed graph with the distribution of the prop- 
erty when considering only the nodes incident to the inserted/removed adversarial 
edges. The comparison indicates a statistically significant difference between the 
distributions. For example, in Figure 8.3 (left) we can see that the Nettack method 
tends to connect a target node to low-degree nodes. This could be due to the degree- 
normalization performed in GCN, where low-degree nodes have a higher weight 
(i.e., influence) on the aggregation of a node. Likewise, considering nodes incident 
to edges removed by the adversary we can observe that the Nettack method tends 
to disconnect high-degree nodes from the target node. In Figure 8.3 (second and 
third plot) we can see that the attack tends to connect the target node with peripheral 
nodes, as evidenced by small two-hop neighborhood size and low closeness cen- 
trality of the adversarially connected nodes. In Figure 8.3 (right) we can see that 
the adversary tends to connect a target node to other nodes which have dissimilar 
attributes. As also shown in other works, the adversary appears to try to counter the 
homophily property in the graph — which is not surprising, since the GNN has likely 
learned to partly infer a node’s class based on its neighbors. 

To understand whether such detected patterns are universal, they can be used 
to design attack principles itself — indeed, this even leads to black-box attacks 
since the analyzed properties usually relate to the graph only and not the GNN. In 
(Ziigner et al, 2020) a prediction model was learned estimating the potential impact 
of a perturbation on unseen graphs using the above mentioned properties as input 
features. While this often resulted in finding effective adversarial perturbations, thus, 
highlighting the generality of the regularities uncovered, the attack performance 
was not on par with the original Nettack attack. Similarly, in (Ma et al, 2020b) 
PageRank-like scores have been used to identify potential harmful perturbations. 


8.2.3 Discussion and Future Directions 


The aspects along which adversarial attacks on graphs can be studied allow for a 
huge variety of scenarios. Only a few of them have been thoroughly investigated 
in the literature. One important aspect to consider, for example, is that in real ap- 
plications the cost of perturbations differ: while changing node attributes might be 
relatively easy, injecting edges might be harder. Thus, designing improved pertur- 
bation spaces can make the attack scenarios more realistic and better captures the 
robustness properties one might want to ensure. Moreover, many different data do- 
mains such as knowledge graphs or temporal graphs need to be investigated. 
Importantly, while first steps have been made to understand the patterns that 
makes these perturbations harmful, a clear understanding with a sound theoretical 
backing is still missing. In this regard, it is also worth repeating that all these studies 
have focused on analyzing perturbations obtained by Nettack; other attacks might 
potentially lead to very different patterns. This also implies that exploiting the re- 
sulting patterns to design more robust GNNs (see Section 8.4.1) is not necessarily 
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a good solution. Moreover, finding reliable patterns also requires more research on 
how to compute adversarial perturbations in a scalable way (Wang and Gong, 2019; 
Geisler et al, 2021), since such patterns might be more pronounced on larger graphs. 


8.3 Provable Robustness: Certificates for Graph Neural 
Networks 


Adversarial attack approaches are heuristics to highlight potential vulnerabilities of 
a GNN. However, they do not provide formal guarantees on the reliability of the 
methods. In particular, an unsuccessful attack does not imply the robustness of the 
GNN. It might just be that the attack approach could simply not find an/the adversar- 
ial example since it does not solve Eq. equation 8.1 exactly. Attacks, when success- 
ful, only provide results about non-robustness. For a safe use of GNNs, however, we 
need the opposite: we need principles for provable robustness. These methods pro- 
vide so called robustness certificates, giving formal guarantees that no perturbation 
regarding a specific perturbation model # (4) will change the prediction. 

Considering, for example, the task of node-level classification, the problem these 
certification approaches are aiming to solve is: Given a graph Y, a perturbation set 
(G), anda GNN fg. Verify that the predicted class for node v stays the same for 
all Ê € (F). If this holds, we say that v is certifiably robust wat. ®(Y). 

Only few robustness certificates so far have been proposed for GNNs. They can 
mainly be categorized into two principles: model-specific and model-agnostic. 


8.3.1 Model-Specific Certificates 


Model-specific certificates are designed for a specific class of GNN models (e.g., 2- 
layer GCNs) and a specific task such as node-level classification. A common theme 
is to phrase certification as a constrained optimization problem: Recall that in a 
classification task, the final prediction is usually obtained by taking the class with 
the largest predicted probability or logit. Let c* = argmax;cy fo(Y)- denote the 
predicted class* obtained on the unperturbed graph Y, where @ is the set of classes 
and f9(Y)- denotes the logit obtained for class c. This specifically implies, that the 
margin fo(Y)-* — fe(Y)- between class c* and any other class c is positive. 

A particularly useful quantity for robustness certification is the worst-case mar- 
gin, i.e., the smallest margin possible under any perturbed data G: 


m(c*,c)= min [fole flA (8.3) 


EDL) 


4 This could either be the predicted class for a specific target node v in case of node-level classi- 
fication; or for the entire graph in case of graph-level classification. We drop the dependency on v 
since it is not relevant for the discussion. For simplicity, we assume the maximizer c* to be unique. 
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Fig. 8.4: Obtaining robustness certificates via the worst-case margin: The predic- 
tion obtained from the unperturbed graph &%; is illustrated with a cross, while the 
predictions for the perturbed graphs &(%,) are illustrated around it. The worst-case 
margin measures the shortest distance to the decision boundary. If it is positive (see 
Gi), all predictions are on the same side of the boundary; robustness holds. If it is 
negative (see Y), some predictions cross the decision boundary; the class prediction 
will change under perturbations, meaning the model is not robust. When using lower 
bounds — the shaded regions in the figure — robustness is ensured for positive val- 
ues (see %4) since the exact worst-case margin can only be larger. If the lower bound 
becomes negative, no statement can be made (see % and %3; robustness unknown). 
Both % and % have a negative lower bound, while the (not tractable to compute) 
exact worst-case margin differs in sign. 


If this term is positive, c can never be the predicted class for node v. And if the 
worst-case margin 71(c*,c) stays positive for all c 4 c*, the prediction is certifiably 
robust since the logit for class c* is always the largest — for all perturbed graphs in 
(F). This idea is illustrated in Figure 8.4. 

As shown, obtaining a certificate means solving the (constrained) optimization 
problem in Eq. equation 8.3 for every class c. Not surprisingly, however, solving 
this optimization problem is usually intractable — for similar reasons as computing 
adversarial attacks is hard. So how can we obtain certificates? Just heuristically 
solving Eq. equation 8.3 is not helpful since we aim for guarantees. 


Lower Bounds on the Worst-Case Margin 


The core idea is to obtain tractable lower bounds on the worst-case margin. That is, 
we aim to find functions zg that ensure ñzg(c*,c) < m(c*,c) and are more effi- 
cient to compute. One solution is to consider relaxations of the original constrained 
minimization problem, replacing, for example, the model’s nonlinearities and hard 
discreteness constraints via their convex relaxation. For example, instead of requir- 
ing that an edge is perturbed or not, indicated by the variables e € {0,1}, we can 
use e € (0, 1]. Intuitively, using such relaxations leads to supersets of the actually 
reachable predictions, as visualized in Figure 8.4 with the shaded regions. 
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Overall, if the lower bound zg stays positive, the robustness certificate holds — 
since / is positive by transitivity as well. This is shown in Figure 8.4 for graph 4%. 
If 17g is negative, no statement can be made since it is only a lower bound of the 
original worst-case margin ñ, which thus can be positive or negative. Compare the 
two graphs % and % in Figure 8.4: While both have a negative lower bound (i.e., 
both shaded regions cross the decision boundary), their actual worst-case margins ñ 
differ. Only for graph % the actually reachable predictions (which are not efficiently 
computable) also cross the decision boundary. Thus, if the lower bound is negative, 
the actual robustness remains unknown — similar to an unsuccessful attack, where 
it remains unclear whether the model is actually non-robust or the attack simply 
not strong enough. Therefore, besides being efficient to compute, the function mz, 
should be as close as possible to m to avoid cases where no answer can be given 
despite the model being robust. 

The above idea, using convex relaxations of the model’s nonlinearities and the 
admissible perturbations, is used in the works (Ziigner and Giinnemann, 2019; 
Ziigner and Giinnemann, 2020) for the class of GCNs and node-level classification. 
In (Ziigner and Giinnemann, 2019), the authors consider perturbations to the node 
attributes and obtain lower bounds via a relaxation to a linear program. The work 
(Ziigner and Giinnemann, 2020) considers perturbations in the form of edge dele- 
tions and reduces the problem to a jointly constrained bilinear program. Similarly, 
also using convex relaxations, Jin et al (2020a) has proposed certificates for graph- 
level classification under edge perturbations using GCNs. Beyond GCNs, model- 
specific certificates for edge perturbations have also been devised for the class of 
GNNs using PageRank diffusion (Bojchevski and Giinnemann, 2019), which in- 
cludes label/feature propagation and (A)PPNP (Klicpera et al, 2019a). The core idea 
of (Bojchevski and Giinnemann, 2019) is to treat the problem as a PageRank opti- 
mization task which subsequently can be expressed as a Markov decision process. 
Using this connection one can indeed show that in scenarios where only local bud- 
gets are used (see Section 8.2; Eq. equation 8.2) the derived certificates are exact, 
i.e., no lower bound, while we can still compute them in polynomial time w.r.t. the 
graph size. In general, all models above consider local and global budget constraints 
on the number of changes. 

Besides providing certificates, being able to efficiently compute (a differentiable 
lower bound on) the worst-case margin as in Eq. equation 8.3 also enables to im- 
prove GNN robustness by incorporating the margin during training, i.e. aiming to 
make it positive for all nodes. We will discuss this in detail in Section 8.4.2. 

Overall, a strong advantage of model-specific certificates is their explicit consid- 
eration of the GNN model structure within the margin computation. However, the 
white-box nature of these certificates is simultaneously their limitation: The pro- 
posed certificates capture only a subset of the existing GNN models and any GNN 
yet to be developed likely requires a new certification technique as well. This limi- 
tation is tackled by model-agnostic certificates. 
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8.3.2 Model-A gnostic Certificates 


Model-agnostic certificates treat the machine learning model as a black-box. For 
example, the work (Bojchevski et al, 2020a) provides certificates for any classifier 
operating on discrete data, including GNNs. Most importantly, it is sufficient to con- 
sider only the output of the classifier for different samples to obtain the certificate. 
This is precisely what makes it particularly appealing for certifying GNNs since it 
allows us to sidestep a complex analysis of the message-passing dynamics and the 
non-linear interactions between the nodes. So far, model-agnostic certificates are 
mainly based on the idea of randomized smoothing (Lecuyer et al, 2019; Cohen 
et al, 2019), originally proposed for continuous data. To handle graphs, extensions 
to discrete data have been proposed. 

The core idea is to base the certificate on a smoothed classifier, which aggregates 
the output of the original (base) GNN when applied to randomly perturbed versions 
of the input graph %. For example, the smoothed classifier might report the most 
likely (majority) class on these randomized samples. While different variants of this 
approach are possible, we provide one intuitive setting in the following to convey 
the main idea. 

Let f : G > @ denote a function (e.g., a GNN) that takes a graph Z € G as input 
and predicts a single class f(Y) =c € @ as output, e.g. a node’s prediction. Let T 
be a smoothing distribution, also called randomization scheme, that adds random 
noise to the input graph. For example, t might randomly add Bernoulli noise to 
the adjacency matrix of Y, corresponding to randomly adding or deleting edges. 
Technically, T assigns probability mass/density Pr(t(Y) = &) to each graph X € 
G. We can construct a smoothed (ensemble) classifier g from the base classifier f 
as follows: 

8(F) = argmax Pr(f(t(¥)) =c) (8.4) 


In other words, g(¥) returns the most likely class obtained by first randomly per- 
turbing the graph Y using T and then classifying the resulting graphs t(Y) with the 
base classifier f. 

As in Section 8.3.1, the goal is to assess whether the prediction does not change 
under perturbations: denoting with c* = g(Y) the class predicted by the smoothed 
classifier on Y, we want g(Y) = c* for all Y € &(Y). Considering for simplicity the 
case of binary classification, this is equivalent to ensure that Pr(f(t(Y)) =c*) > 0.5 
for all Y € &(Y); or short: ming. gg) Pr(f(t(Y)) =c*) > 0.5. 

Since, unsurprisingly, the term is intractable to compute, we refer again to a lower 
bound to obtain the certificate: 


min min Pr(h(t(Y)) =c*) < min Pr(f(t(Y)) = ~) (8.5) 
Ge@b(G)hEH;, GeBL) 


Here, #4 is the set of all classifiers sharing some properties with f, e.g., often 
that the smoothed classifier based on h and f would return the same probability for 
G, i.e., Pr(h(t(Y)) =c*) = Pr(f(t(Y)) =c*). Since f € AH, the inequality holds 
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trivially. Accordingly, if the left hand side of Eq. equation 8.5 is larger than 0.5, 
also the right hand side is guaranteed to be so, implying that Y would be certifiably 
robust. 

What does Eq. equation 8.5 intuitively mean? It aims to find a base classifier h 
which minimizes the probability that the perturbed sample G is assigned to class c*. 
Thus, A represents a kind of worst-case base classifier which, when used within the 
smoothed classifier, tries to obtain a different prediction for @G. If even this worst- 
case base classifier leads to certifiable robustness (left hand side of Eq. equation 8.5 
larger than 0.5), then surely the actual base classifier at hand has well. 

The most important part to make this all useful, however, is the following: given 
a set of classifiers #%;, finding the worst-case classifier h and minimizing over the 
perturbation model ®(¥%) is often tractable. In some cases, the optima can even 
be calculated in closed-form. This shows some interesting relation to the previous 
section: There, the intractable minimization over ®(Y) in Eq. equation 8.3 was re- 
placed by some tractable lower bound, e.g., via relaxations. Now, by finding a worst- 
case classifier h we not only obtain a lower bound but minimization over B(Y) 
becomes often also immediately tractable. Note, however, that in Section 8.3.1 we 
obtain a certificate for the base classifier f, while here we obtain a certificate for the 
smoothed classifier g. 


Putting Model-Agnostic Certificates into Practice 


As said, given a set of classifiers %4, finding the worst-case classifier h and min- 
imizing over the perturbation model ®(¥Y) is often tractable. The main compu- 
tational challenge in practice lies in determining #4. Let’s consider our previ- 
ous example where we enforced all classifiers h to ensure Pr(h(t(Y)) = c*) = 
Pr(f(t(Y)) = c*). To determine #4, one needs to compute Pr(f(t(Y)) = œ). 
Clearly, doing this exactly is again usually intractable. Instead, the probability can 
be estimated using sampling. To ensure a tight approximation, the base classifier has 
to be fed a large number of samples from the smoothing distribution. This becomes 
increasingly expensive as the size and complexity of the GNN model increases. 
Furthermore, the resulting estimates only hold with a certain probability. Accord- 
ingly, also the derived guarantees have the same probability, i.e., one obtains only 
probabilistic robustness certificates. Despite these practical limitations, randomized 
smoothing has become widely popular, as it is often still more efficient than model- 
specific certificates. 

This general idea of model-agnostic certificates has been investigated for discrete 
data in (Lee et al, 2019a; Dvijotham et al, 2020; Bojchevski et al, 2020a; Jia et al, 
2020), with the latter two focusing also on graph-related tasks. In (Jia et al, 2020), 
the authors investigate the robustness of community detection. In (Bojchevski et al, 
2020a), the main focus is on node-level and graph-level classification w.r.t. graph 
structure and/or attribute perturbations under global budget constraints. Specifically, 
Bojchevski et al (2020a) overcomes critical limitations of the other approaches in 
two regards: it explicitly accounts for sparsity in the data as present in many graphs, 
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and it obtains strong certificates with a dramatically reduced computational com- 
plexity. Both aspects are core to making certification useful and possible for graph 
data. Since the approach of (Bojchevski et al, 2020a) is agnostic to the underly- 
ing classifier — it can be used as long as the input is discrete — it has been applied 
to various GNNs including GCN, GAT, (A)PPNP (Klicpera et al, 2019a), RGCN 
(Zhu et al, 2019a), and Soft Medoid (Geisler et al, 2020) as well as node-level and 
graph-level classification. 


8.3.3 Advanced Certification and Discussion 


Research on robustness certificates for GNNs is still in a very early stage. As we 
have seen in Section 8.2, the space of attacks is vast with different properties to 
study and perturbation models to consider. The methods discussed above cover only 
a few of these scenarios. 

One step forward to more powerful certificates is the work of (Schuchardt et al, 
2021). Like in local attacks to individual nodes, existing robustness certificates aim 
to certify each prediction independently. Thus, they assume that an adversary can 
use different perturbed inputs to attack different predictions. Alternatively, and sim- 
ilar to a global attack, the work (Schuchardt et al, 2021) introduces collective ro- 
bustness certificates which compute the number of predictions which are simultane- 
ously guaranteed to remain stable under perturbation. That is, it exploits the fact that 
a GNN simultaneously outputs multiple predictions based on a single shared input. 
Given a fixed perturbation budget, using this idea, the number of certifiable predic- 
tions can be increased by orders of magnitudes compared to certifying each pre- 
diction independently. The work, however, can not handle perturbation models with 
edge additions. As mentioned before, both views — local and global — are reasonable 
and it depends on the application which robustness guarantee is more relevant. 

To cover the full spectrum of GNN applications, surely further certificates for 
other scenarios and tasks are required. Specifically, so far, all certificates assume an 
evasion attack scenario. It is also worth repeating that in the randomized smoothing 
approaches discussed above, we are actually certifying the smoothed (ensemble) 
classifier, and not the underlying base classifier. From a practitioner’s point of view 
this means that obtaining a single prediction always requires to feed a large amount 
of samples through the GNN, leading to a scalability bottleneck which needs to be 
tackled in the future. 


8.4 Improving Robustness of Graph Neural Networks 


As we have established, standard GNNs trained in the usual way are not robust to 
even small changes to the graph, thus, using them in sensitive and critical applica- 
tions might be risky. Certificates can provide us guarantees about their performance. 
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However, as a consequence of the non-robustness, the certificates rarely hold for 
standard models, i.e., only few predictions can be certified. To tackle this limita- 
tion, methods aiming to improve robustness have been investigated, i.e. making the 
models less susceptible to perturbations.° In this regard, three broad, not mutually 
exclusive, categories can be identified. 


8.4.1 Improving the Graph 


One seemingly clear direction to improve robustness is to remove perturbations from 
the data, i.e., to revert the performed malicious changes and obtain a more ‘clean’ 
graph. While this may sound simple, the inherent challenge is that adversarial per- 
turbations are usually designed to be imperceptible, which makes their identification 
difficult. Still, as seen in Section 8.2.2.1, some patterns might be present. 

Works such as (Zhang et al, 2019b) exploit this idea to perform a ‘cleaning’ of 
the graph before it is used as input to the GNN, relying on observations that, for 
example, the predicted class distribution changes for attacked nodes. Similarly, for 
attributed graphs, Wu et al (2019b) removes potential adversarial edges based on 
the Jaccard similary between the nodes’ attributes. Such pre-processing steps are 
not limited to be ‘attack detection’ approaches that try to spot individual suspicious 
nodes are edges; they can also be thought of as a kind of denoising. Indeed, the 
work (Entezari et al, 2020) analyzed that perturbations performed by Nettack affect 
mainly the high-rank (low-valued) singular components of the graph’s adjacency 
matrix. Thus, to improve robustness they compute a low-rank approximation of the 
graph which aims to remove the (adversarial) noise in a pre-processing procedure. 
The limitation is that the resulting graph becomes dense. Overall, such graph clean- 
ing can be used in poisoning as well as evasion scenarios. Note, though, that an 
approach that has shown to perform well in one scenario, does not imply the suc- 
cess in another. 

More generally, while these approaches have shown to be effective in specific 
scenarios, one has to be aware of one crucial limitation: the exploited patterns are 
often based on specific attacks like Nettack. Thus, the resulting detections might be 
limited to certain perturbations and potentially do not generalize to other scenarios. 

Improving the graph is not restricted to happen before the training or the infer- 
ence step, i.e. we do not need to follow a sequential approach of first cleaning and 
then learning a prediction model. Instead, the cleaning can be interwoven with the 
learning approach itself. Intuitively speaking, in order to minimize the correspond- 
ing training loss, one jointly learns the GNN parameters and also how to clean the 
graph itself. The benefit of this joint learning approach is that the specific model and 
task at hand can be taken into account, while the conditions enforced on the clean 
graph can be rather weak, e.g., only requiring that perturbations should be sparse. 


> In some works, such approaches are called (heuristic) defenses to highlight their increased re- 
silience to attacks. Similarly, some works use the term provable defense when referring to certifi- 
cates since they provably prevent attacks to be harmful that are within a certified set # (9). 
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Interestingly, even before the rise of graph neural networks, such joint approaches 
have been investigated, e.g., in (Bojchevski et al, 2017) to improve the robustness of 
spectral embeddings. For GNNs, such graph structure learning has been proposed in 
(Jin et al, 2020e; Luo et al, 2021) where certain properties like low-rank graph struc- 
ture and attribute similarity are used to define how the clean graph should preferably 
look like. 


8.4.2 Improving the Training Procedure 


As discussed in Section 8.2.2, one further reason for the non-robustness of GNNs 
are the parameters/weights learned during training. Weights resulting from standard 
training often lead to models that do not generalize well to slightly perturbed data. 
This is illustrated in Figure 8.5 with the orange/solid decision boundary. Note that 
the figure shows the input space, i.e., the space of all graphs G; this is in contrast to 
Figure 8.4 which shows the predicted probabilities. If we were able to improve our 
training procedure to find ‘better’ parameters — taking into account that the data is or 
might become potentially perturbed — the robustness of our model would improve 
as well. This is illustrated in Figure 8.5 with the blue/dashed decision boundary. 
There, all perturbed graphs from ®,(¥) get the same prediction. As seen, in this 
regard robustness links to the generalization performance of prediction models in 
general. 


8.4.2.1 Robust Training 


Robust training refers to training procedures that aim at producing models that are 
robust to adversarial (and/or other) perturbations. The common theme is to optimize 
a worst-case loss (also called robust loss), i.e. the loss achieved under the worst-case 
perturbation. Technically, the training objective becomes: 
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0* =argmin max Lirain( fo (¥)) (8.6) 
9 Gea) 

where fg is the GNN with its trainable weights. As shown, we do not evaluate the 

loss at the unperturbed graph but instead use the loss achieved in the worst case 

(compare this to the standard training where we simply minimize Yrain(fo(Y))). 

The weights are steered to obtain low loss under these worst scenarios as well, thus 

obtaining better generalization. 

Not surprisingly, solving Eq. equation 8.6 is usually not tractable for the same 
reasons as finding attacks and certificates is hard: we have to solve a discrete, highly 
complex (minmax) optimization problem. In particular, for training, e.g., via gradi- 
ent based approaches, we also need to compute the gradient w.r.t. the inner maxi- 
mization. Thus, for feasibility, one usually has to refer to various surrogate objec- 
tives, substituting the worst-case loss and the resulting gradient by simpler ones. 


Data Augmentation during Training 


In this regard, the most naive approach is to randomly draw samples from the pertur- 
bation set ®(Y) during each training iteration. That is, during training the loss and 
the gradient are computed w.r.t. these randomly perturbed samples; with different 
samples drawn in each training iteration. If the perturbation set, for example, con- 
tains graphs where up to x edge deletions are admissible, we would randomly create 
graphs with up to x edges dropped out. Such edge dropout has been analyzed in 
various works but does not improve adversarial robustness substantially (Dai et al, 
2018a; Ziigner and Giinnemann, 2020); a possible explanation is that the random 
samples simply do not represent the worst-case perturbations well. 

Thus, more common is the approach of adversarial training (Xu et al, 2019c; 
Feng et al, 2019a; Chen et al, 2020i). Here, we do not randomly sample from the 
perturbation set, but in each training iteration we create adversarial examples G and 
subsequently compute the gradient w.r.t. these. As these samples are expected to 
lead to a higher loss, the result of the inner max-operation in Eq. equation 8.6 is 
much better approximated. Instead of perturbing the input graph, the work (Jin and 
Zhang, 2019) has investigated a robust training scheme which perturbs the latent 
embeddings. 

It is interesting to note that adversarial training in its standard form requires la- 
beled data since the attack aims to steer towards an incorrect prediction. In the typi- 
cal transductive graph-learning tasks, however, large amounts of unlabeled data are 
available. As a solution, virtual adversarial training has also been investigated (Deng 
et al, 2019; Sun et al, 2020d), operating on the unlabeled data as well. Intuitively, 
it treats the currently obtained predictions on the unperturbed graph as the ground 
truth, making it a kind of self-supervised learning. The predictions on the perturbed 
data should not deviate from the clean predictions, thus enforcing smoothness. 

Using (virtual) adversarial training has empirically shown some improvements 
in robustness, but not consistently. In particular, to well approximate the max term 
in the robust loss of Eq. equation 8.6, we need powerful adversarial attacks, which 
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are typically costly to compute for graphs (see Section 8.2). Since here attacks need 
to be computed in every training iteration, the training process is slowed down sub- 
stantially. 


Beyond Data Augmentation - Certificate-Based Loss Functions 


At the end of the day, the techniques above perform a costly data augmentation dur- 
ing training, i.e., they use altered versions of the graph. Besides being computation- 
ally expensive, there is no guarantee that the adversarial examples are indeed good 
proxies for the max term in Eq. equation 8.6. An alternative approach, e.g., followed 
by (Ziigner and Giinnemann, 2019; Bojchevski and Giinnemann, 2019) relies on the 
idea of certification as discussed previously. Recall that these techniques compute a 
lower bound 77,z on the worst-case margin. If it is positive, the prediction is robust 
for this node/graph. Thus, the lower bound itself acts like a robustness loss Yop, for 
example instantiated as a hinge loss: max(0, 6 — siz). If the lower-bound is above 
6, then the loss is zero; if it is smaller, a penalty occurs. Combining this loss func- 
tion with, e.g., the usual cross-entropy loss, forces the model not only to obtain good 
classification performance but also robustness. 

Crucially, op and, thus, the lower bound need to be differentiable since we need 
to compute gradients for training. This, indeed, might be challenging since usually 
the lower bound itself is still an optimization problem. While in some special cases 
the optimization problem is directly differentiable (Bojchevski and Giinnemann, 
2019), another general idea is to relate to the principle of duality. Recall that the 
worst-case margin 7 (or a potential corresponding lower bound /71z,z) is the result of 
a (primal) minimization problem (see Eq. equation 8.3). Based on the principle of 
duality, the result of the dual maximization problem provides, as required, a lower 
bound to this value. Even more, any feasible solution of the dual problem provides 
a lower bound on the optimal solution. Thus, we actually do not need to solve the 
dual program. Instead, it is sufficient to compute the objective function of the dual at 
any single feasible point to obtain an (even lower, thus looser) lower bound; no op- 
timization is required and computing gradients often becomes straightforward. This 
principle of duality has been used in (Ziigner and Giinnemann, 2019) to perform 
robust training in an efficient way. 


8.4.2.2 Further Training Principles 


Robust training is not the only way to obtain ‘better’ GNN weights. In (Tang 
et al, 2020b), for example, the idea of transfer learning (besides further architecture 
changes; see next section) is exploited. Instead of purely training on a perturbed 
target graph, the method adopts clean graphs with artificially injected perturbations 
to first learn suitable GNN weights. These weights are later transferred and fine- 
tuned to the actual graph at hand. The work (Chen et al, 2020i) exploits smoothing 
distillation where one trains on predicted soft labels instead of ground-truth labels 
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to enhance robustness. The work (Jin et al, 2019b) argues that graph powering en- 
hances robustness and proposes to minimize the loss not only on the original graph 
but on a set of graphs consisting of the different graph powers. Lastly, the authors 
of (You et al, 2021) use a contrastive learning framework using different (graph) 
data augmentations. Albeit adversarial robustness is not their focus, they report in- 
creased adversarial robustness against the attacks of (Dai et al, 2018a). In general, 
changing the loss function or regularization terms leads to different training, though 
the effects on robustness for GNNs are not fully understood yet. 


8.4.3 Improving the Graph Neural Networks’ Architecture 


The final category of methods improving robustness is concerned with designing 
novel GNN architectures itself. Architecture engineering is one core component of 
neural network research in general, with many advancements in the last years. While 
traditionally focusing on improving prediction performance, a likewise important 
property becomes the methods’ robustness — both being potentially opposing goals. 


8.4.3.1 Adaptively Down-Weighting Edges 


Inspired by the idea of graph cleaning as discussed before, a natural idea is to en- 
hance the GNN by mechanisms to reduce the impact of perturbed edges. An obvious 
choice for this are edge attention principles. However, it is a false conclusion to as- 
sume that standard attention-based GNNs like GAT are immediately suitable for 
this task. Indeed, as shown in (Tang et al, 2020b; Zhu et al, 2019a) such models are 
non-robust. The problem is that these models still assume clean data to be given; 
they are not aware that the graph might be perturbed. 

Thus, other attention approaches try to incorporate more information in the pro- 
cess. In (Tang et al, 2020b) the attention mechanism is enhanced by taking clean 
graphs into account for which perturbations have been artificially injected. Since 
now ground truth information is available (i.e., which edges are harmful), the atten- 
tion can try to learn down-weighing these while retaining the non-perturbed ones. 
An alternative idea is used in (Zhu et al, 2019a). Here, the representations of each 
node in each layer are no longer represented as vectors but as Gaussian distribution. 
They hypothesize that attacked nodes tend to have large variances, thus using this 
information within the attention scores. Further attention mechanism considering, 
e.g., the model and data uncertainty or the neighboring nodes’ similarity have been 
proposed in (Feng et al, 2021; Zhang and Zitnik, 2020). 

An alternative to edge attention is to enhance the aggregation used in message 
passing. In a GNN message passing step, a node’s embedding is updated by aggre- 
gating over its neighbors’ embeddings. In this regard, adversarially inserted edges 
add additional data points to the aggregation and therefore perturb the output of the 
message passing step. Aggregation functions such as sum, weighted mean, or the 
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max operation used in standard GNNs can be arbitrarily distorted by only a single 
outlier. Thus, inspired by the principle of robust statistics, the work (Geisler et al, 
2020) proposes to replace the usual GNN’s aggregation function with a differen- 
tiable version of the Medoid, a provably robust aggregation operation. The idea of 
enhancing the robustness of the aggregation function used during message passing 
has further been investigated in (Wang et al, 20200; Zhang and Lu, 2020). 

Overall, all these methods down-weight the relevance of edges, with one cru- 
cial difference to the methods discussed in Section 8.4.1: they are adaptive in the 
sense that the relevance of each edge might vary between, e.g., the different lay- 
ers of the GNN. Thus, an edge might be excluded/down-weighted in the first layer 
but included in the second one, depending on the learned intermediate represen- 
tation. This allows a more fine-grained handling of perturbations. In contrast, the 
approaches in Section 8.4.1 derive a single cleaned graph that is used in the entire 
GNN. 


8.4.3.2 Further Approaches 


Many further ideas to improve robustness have been proposed, which do not all en- 
tirely fit into the before mentioned categories. For example, in (Shanthamallu et al, 
2021) a surrogate classifier is trained which does not access the graph structure but 
is aimed to be aligned with the predictions of the GNN, both being jointly trained. 
Since the final predictor is not using the graph but only the node’s attributes, higher 
robustness to structure perturbations is hypothesized. The work (Miller et al, 2019) 
proposes to select the training data in specific ways to increase robustness, and Wu 
et al (2020d) uses the principle of information bottleneck, an information theoretic 
approach to learn representations balancing expressiveness and robustness. Finally, 
also randomized smoothing (Section 8.3.2) can be interpreted as a technique to im- 
prove adversarial robustness by using an ensemble of predictors on randomized in- 
puts. 


8.4.4 Discussion and Future Directions 


Considering the current state of research, a surprising observation is that robustness 
to graph structure perturbations is not well achieved via adversarial training. This 
is in stark contrast to, e.g., the image domain where robust training (in the form 
of adversarial training) can be considered one of the highly suitable techniques to 
improves robustness (Tramer et al, 2020). Focusing on perturbations of the node at- 
tributes, in contrast, robust training indeed performs very well as shown in (Ziigner 
and Giinnemann, 2019). Surprisingly, such robust training (targeting attributes) also 
improves robustness under graph structure perturbations (Ziigner and Giinnemann, 
2020) — and, even more, outperforms several adversarial training strategies perform- 
ing edge dropout. The question remains if structure perturbations have special prop- 
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erty that diminishes the effect of robust training or whether the generated adversarial 
perturbations are not capturing the worst-case; showcasing again the hardness of the 
problem. This might also explain why the majority of works have focused on prin- 
ciples of weighting/filtering out edges. 

In this regard, it is again important to remember that all approaches are typi- 
cally designed with a specific perturbation model ®(¥Y) in mind. Indeed, down- 
weighting/filtering edges implicitly assumes that adversarial edges had been added 
to the graph. Adversarial edge deletions, in contrast, would require to identify po- 
tential edges to (re)add. This quickly becomes intractable due to the large number of 
possible edges and has not been investigated so far. Moreover, only a few methods 
so far have provided theoretical guarantees on the methods’ robustness behavior. 


8.5 Proper Evaluation in the View of Robustness 


Progress in the field of GNN robustness requires sound evaluation of the proposed 
techniques. Importantly, we have to be aware of the potential trade-off between 
prediction performance (e.g., accuracy) and robustness. For example, we can easily 
obtain a highly robust classification model by simply always predicting the same 
class. Clearly, such a model has no use at all. Thus, the evaluation always involves 
two aspects: (1) Evaluation of the prediction performance. For this, one can simply 
refer to the established evaluation metrics such as accuracy, precision, recall, or 
similar, as known for the various supervised and unsupervised learning tasks. (2) 
Evaluation of the robustness performance. 

Perturbation set and radius. Regarding the latter, the first noteworthy point is that 
robustness always links to a specific perturbation set ®(.) that defines the perturba- 
tions the model should be robust to. To enable a proper evaluation, existing works 
therefore usually define some parametric form of the perturbation set, e.g., denoted 
®,(Y) where r is the maximal number of changes — the budget — we are allowed to 
perform (e.g., maximal number of edges to add). The variable r is often referred to 
as the radius. This is because the budget usually coincides with a certain maximal 
norm/distance we are willing to accept between graph and perturbed ones. A gen- 
eralization of the above form to consider multiple budgets/radii is straightforward. 
Varying the radius enables us to analyze the robustness behavior of the models in de- 
tail. Depending on the radius, different robustness results are expected. Specifically, 
for a large radius low robustness is expected — or even desired — and accordingly, 
the evaluation should also include these cases showing the limits of the models. 

Recall that using the methods discussed in Section 8.2 and Section 8.3 together, 
we are able to obtain one of the following answers about a prediction’s robustness: 
(R) It is robust; the certificate holds since, e.g., the lower bound on the margin 
is positive. (NR) It is non-robust; we are able to find an adversarial example. (U) 
Unknown; no statement possible since, e.g., the lower bound is negative but the 
attack was not successful either. 
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Figure 8.6 shows such an example analysis providing insights about the robust- 
ness properties of a GCN in detail. Here, local attacks and certificates are computed 
on standard (left) and robustly (right) trained GCNs for the task of node classifica- 
tion. As the result shows, robust training indeed increases the robustness of a GCN 
with fewer attacks being successful and more nodes being certifiable. 
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Fig. 8.6: Share of nodes which are provably robust (blue; R), non-robust via ad- 
versarial example construction (orange; NR), or whose robustness is unknown 
(“gap”; U), for increasing perturbation radii. For a given radius, the shares of 
(R)+(NR)+(U)= 100%. Left: Standard training; Right: robust training as pro- 
posed in (Ziigner and Giinnemann, 2019). Citeseer data and perturbations of node 
attributes. 


It is worth highlighting that case (U) — the white gap in Figure 8.6 — occurs 
only due to the algorithmic inability to solve the attack/certificate problems exactly. 
Thus, case (U) does not give a clear indication about the GNN’s robustness but rather 
about the performance of the attack/certificate.° Given this set-up, in the following 
we distinguish between two evaluation directions, which are reflected in frequently 
used measures. 


Empirical Robustness Evaluation 


In an empirical robustness evaluation, we perform an attack on the graph and ob- 
serve the effects. Common measures are: 


e The drop in performance of the downstream task (e.g., node classification ac- 
curacy), monitoring its decrease after the attack. This metric is typically used in 
combination with global attacks where a single perturbation is considered that 
aims to jointly change multiple predictions (see Section 8.2.1, Aspect 1). 


é A large gap indicates that the attacks/certificates are rather loose. The gap might become smaller 
when improved attacks/certificates become available. Thus, attacks/certificates itself can be eval- 
uated by analyzing the size of the gap since it shows what the maximal possible improvement 
in either direction is (e.g., the true share of robust predictions can never exceed 100%-NR for a 
specific radius). 
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e The attack success rate, measuring how many predictions were successfully 
changed by the attack(s). This simply corresponds to the case (NR), the orange 
region shown in Fig 8.6. This metric is typically used in combination with local 
attacks where for each prediction a different perturbation can be used. Naturally, 
the local attacks’ success rate is higher than the overall performance drop due 
to the flexibility in picking different perturbations. 

e In the case of classification, the classification margin, i.e., the difference be- 
tween the predicted probability of the ‘true’ class minus the second-highest 
class, and its drop after the attack. See again Figure 8.2 for an example. 


The crucial limitation of this evaluation is its dependence on a specific attack 
approach. The power of the attack strongly affects the result. Indeed, it can be re- 
garded as an optimistic evaluation of robustness since a non-successful attack is 
treated as seemingly robust. However, the conclusion is dangerous since a GNN 
might only perform well for one type of attack but not another. Thus, the above 
metrics rather evaluate the power of the attack but only weakly the robustness of the 
model. Interpreting the results has to be done with care. Consequently, when refer- 
ring to empirical robustness evaluation, it is imperative to use multiple different and 
powerful attack approaches. Indeed, as also discussed in (Tramer et al, 2020), each 
robustification principle should come with its own specifically suited attack method 
(also called adaptive attack) to showcase its limitations. 


Provable Robustness Evaluation 


A potentially more suitable direction to analyze the robustness behavior of GNNs 
is to consider provable robustness. As discussed above, case (U) corresponds to un- 
clear predictions for which no robustness statement can be given. Since we care 
about worst-case robustness, we have to assume that these predictions are non- 
robust as well. In short: (NR) and (U) should be rare, while case (R) should dom- 
inate: the number of certifiably robust predictions. Given this idea, the following 
evaluation metrics are often considered: 


e Certified ratio: It corresponds to the number of predictions that can be certified 
as robust given a specific radius r in relation to the number of all predictions. 
Again take note whether for each prediction a different perturbation can be cho- 
sen from ®,(Y) (local) or only a joint single one (global). Clearly, the global 
certified ratio is necessarily (and often significantly) larger than the local one. 

e Certified correctness: In cases like classification, a prediction can be correct or 
incorrect. If it is correct and can be certified as well, the prediction is called 
certified correct. The other, highly undesired, extreme are predictions that are 
certified incorrect; they are very reliably misclassified. 

e Certified performance: Based on the idea of certified correct predictions we can 
also derive a certified version of the original performance metrics, e.g., certified 
accuracy. Here, only those predictions are treated as correct for the metric if they 
are ‘certified correct’. All other predictions, either incorrect or non-certifiable 


8 Graph Neural Networks: Adversarial Robustness 175 


are treated as wrong. The certified performance gives a provable lower bound 
on the performance of the GNN under any admissible perturbation w.r.t. the 
current perturbation set ®,(Y) and the given data. 

e Certified radius: While the above metrics assume a fixed ®,(Y), i.e., a fixed 
radius r, we can also take another view. For a specific prediction, the largest 
radius r* for which the prediction can still be certified as robust is called its 
certified radius. Given the certified radius of a single prediction, one can easily 
calculate the average certifiable radius over multiple predictions. 


Fig. 8.7 Certified ratio of 
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Figure 8.7 shows the certified ratio for different GNN architectures for the task 
of node-classification when perturbing the graph structure. The smoothed classifier 
uses 10,000 randomly drawn graphs and the probabilistic certification is based on a 
confidence level of œ = 0.05 analogously to the set-up in (Geisler et al, 2020). Since 
local attacks are considered, the certified ratio is naturally rather low. Still, as shown, 
there is a significant difference between the models’ robustness performance. 

Provable robustness evaluation provides strong guarantees in the sense that the 
evaluation is more pessimistic. E.g. if the certified ratio is high, we know that the 
actual GNN can only be better. Note again, however, that we still also implicitly 
evaluate the certificate; with new certificates the result might become even better. 
Also recall that certificates based on randomized smoothing (Section 8.3.2), eval- 
uate the robustness of the smoothed classifier, thus, not providing guarantees for 
the base classifier itself. Still, a robust prediction of the smoothed classifier entails 
that the base classifier predicts the respective class with a high probability w.r.t. the 
randomization scheme. 

As it becomes apparent, evaluating robustness is more complex than evaluating 
usual prediction performance. To achieve a detailed understanding of the robustness 
properties of GNNs it is thus helpful to analyze all aspects introduced above. 


8.6 Summary 


Along with the increasing relevance of graph neural networks in various application 
domains, comes also an increasing demand to ensure their reliability. In this regard, 
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adversarial robustness plays a central role since perturbed data is omnipresent. As 
we have seen, standard GNN architectures and training principles — as predomi- 
nantly used in today’s applications — lead to non-robust models, with all the un- 
desired consequences included. However, there is hope: First, various principles to 
improve robustness of GNNs have started to emerge. The obtained results are al- 
ready promising giving a first indication that improved robustness can be achieved 
without giving up too much of the GNNs’ prediction performance. Second, robust- 
ness certificates provide us ways to even assess certain robustness properties in a 
formal way. That is, one does not need to rely on heuristics but instead obtains guar- 
antees of the GNN’s behavior. In all these directions, one has just started to explore 
the vast possibilities and many challenges still need to be tackled. Thus, in the up- 
coming years, various further insights can be expected, pursuing one common goal: 
to continue the success story of graph neural networks by enabling their reliable use 
in even sensitive and safety-critical domains. 
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Editor’s Notes: Adversarial Robustness is one of the hottest topics in Ma- 
chine Learning/Deep Learning today. This wave of research starts from the 
robustness of Convolutional Neural Networks in computer vision domain 
and has rapidly influenced other ML/DL network architectures in other ap- 
plications domains like NLP and Graphs. Adversarial robustness of Graph 
Neural Networks is a very important research area, which has a fundamen- 
tal impact on many other learning tasks, including graph classification task 
(Chapter 9), link prediction (Chapter 10), graph generation-related tasks 
(Chapter 11 and Chapter 12), graph matching networks (Chapter 13), and 
so on. Some chapters (like Chapter 14) can be treated one of potential ways 
to help alleviate the effect of adversarial robustness by learning a graph 
structure beyond its intrinsic graph structure. 
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Frontiers of Graph Neural Networks 
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Graph Neural Networks: Graph Classification 


Christopher Morris 


Abstract Recently, graph neural networks emerged as the leading machine learn- 
ing architecture for supervised learning with graph and relational input. This chapter 
gives an overview of GNNs for graph classification, i.e., GNNs that learn a graph- 
level output. Since GNNs compute node-level representations, pooling layers, i.e., 
layers that learn graph-level representations from node-level representations, are 
crucial components for successful graph classification. Hence, we give a thorough 
overview of pooling layers. Further, we overview recent research in understand- 
ing GNN’s limitations for graph classification and progress in overcoming them. 
Finally, we survey some graph classification applications of GNNs and overview 
benchmark datasets for empirical evaluation. 


9.1 Introduction 


Graph-structured data is ubiquitous across application domains ranging from chemo- 
and bioinformatics (Barabasi and Oltvai, 2004; Stokes et al, 2020) to image (Si- 
monovsky and Komodakis, 2017) and social network analysis (Easley et al, 2012). 
To develop successful (supervised) machine learning models in these domains, we 
need techniques to exploit the graph structure’s rich information and the feature in- 
formation within nodes and edges. In recent years, numerous approaches have been 
proposed for (supervised) machine learning with graphs—most notably, approaches 
based on graph kernels (Kriege et al, 2020) and, more recently, using graph neural 
networks (GNNs), see (Chami et al, 2020; Wu et al, 2021d) for a general overview. 
Graph kernels work by predefining a fixed set of features, following a two-step 
feature extraction and learning task approach. They first compute a vectorial repre- 
sentation of the graph based on predefined features, e.g., small subgraphs, random 
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walks, neighborhood information, or a positive semi-definite kernel matrix reflect- 
ing pairwise graph similarities. The resulting features or the kernel matrix are then 
plugged into a learning algorithm such as a Support Vector Machine. Hence, they 
rely on human-made feature engineering. 

GNNs promise that they possibly offer better adaption to the learning task at 
hand by learning feature extraction and downstream tasks in an end-to-end fashion. 
One of the most prominent tasks for GNNs is graph classification or regression, i.e., 
predicting the class labels or target values of a set of graphs, such as properties of 
chemical molecules (Wu et al, 2018). Since GNNs learn vectorial representations of 
nodes, or node-level representations, for successful graph classification, the pool- 
ing layer, i.e., a layer that learns a graph-level from node-level representations, is 
crucial. This pooling layer aims to learn, based on the node-level representations, 
a vectorial representation that captures the graph structure as a whole. Ideally, one 
wants a graph-level representation that captures local patterns, their interaction, and 
global patterns. However, the optimal representation should adapt to the given data 
distribution. What is more, GNNs for graph classification have recently successfully 
been applied to an extensive range of application areas, the most promising being 
in pharmaceutical drug research; see (Gaudelet et al, 2020) for a survey. Other im- 
portant application areas include fields such as material science (Xie and Grossman, 
201f8), process engineering (Schweidtmann et al, 2020), and combinatorial opti- 
mization (Cappart et al, 2021), some of which we also survey here. 

In the following, we give an overview of GNNs for graph classification. Starting 
from the mid-nineties’ classic works, we survey modern works from the current 
deep learning era, followed by an in-depth review of recent pooling layers. 

Before GNNs emerged as the leading architecture for graph classification, re- 
search focused on kernel-based algorithms, so-called graph kernels, which work by 
predefining a set of features. Starting from the early 2000s, researchers proposed 
a plethora of graph kernels, based on graph features such as shortest-paths (Borg- 
wardt et al, 2005), random walks (Kang et al, 2012; Sugiyama and Borgwardt, 2015; 
Zhang et al, 20181), local neighborhood information (Shervashidze et al, 201 1a; 
Costa and De Grave, 2010; Morris et al, 2017, 2020b), and matchings (Frohlich 
et al, 2005; Woźnica et al, 2010; Kriege and Mutzel, 2012; Johansson and Dub- 
hashi, 2015; Kriege et al, 2016; Nikolentzos et al, 2017); see (Kriege et al, 2020; 
Borgwardt et al, 2020) for thorough surveys. For a thorough survey on GNNs, e.g., 
see (Hamilton et al, 2017b; Wu et al, 2021d; Chami et al, 2020). 


9.2 Graph neural networks for graph classification: Classic 
works and modern architectures 


In the following, we survey classic and modern works of GNNs for graph classifi- 
cation. GNNs layers for graph classification date back to at least the mid-nineties 
in chemoinformatics. For example, Kireev (1995) derived GNN-like neural ar- 
chitectures to predict chemical molecule properties. The work of (Merkwirth and 


9 Graph Neural Networks: Graph Classification 181 


Lengauer, 2005) had a similar aim. Gori et al (2005); Scarselli et al (2008) proposed 
the original GNN architecture, introducing the general formulation that was later 
reintroduced and refined in (Gilmer et al, 2017) by deriving the general message- 
passing formulation, most modern GNN architectures can be expressed in, see Sec- 
tion 9.2.1. 

We divide our overview of modern GNN layers for graph classification into spa- 
tial approaches, i.e., ones that are purely based on the graph structure by aggre- 
gating local information around each node, and spectral approaches, i.e., ones that 
rely on extracting information from the graph’s spectrum. Although this division is 
somewhat arbitrary, we stick to it due to historical reasons. Due to the large body of 
different GNN layers, we cannot offer a complete survey but focus on representative 
and influential works. 


9.2.1 Spatial approaches 


One of the earliest modern, spatial GNN architectures for graph classification was 
presented in (Duvenaud et al, 2015b), focusing on the prediction of chemical 
molecules’ properties. Specifically, the authors propose to design a differentiable 
variant of the well-known Extended Connectivity Fingerprint (ECFP) (Rogers and 
Hahn, 2010) from chemoinformatics, which works similar to the computation of the 
WL feature vector. For the computation of their GNN layer, denoted Neural Graph 
Fingerprints, Duvenaud et al (2015b) first initialize the feature vector f° (v) of each 
node v with features of the corresponding atom, e.g., a one hot-encoding represent- 
ing the atom type. In each iteration or layer t, they compute a feature representation 


f’ (v) for node v as 
y= y+ E en), 
weN(v) 


followed by the application of a one-layer perceptron. Here, N (v) denotes the neigh- 
borhood of node v, i.e., N(v) = {w € ¥ | (v,w) € £}. Since the ECFP usually com- 
putes sparse feature vectors for small molecules, they apply a linear layer followed 
by a softmax function, i.e., 


po) = softmax (f"(v) -H"), 


which they interpret as a sparsification layer, where H’ is the parameter matrix of 
the linear layer. The final pooled graph-level representation is computed by sum- 
ming over all layers’ features, and the resulting feature is fed into an MLP for the 
downstream regression and classification. The above GNN layer is compared to the 
ECFP on molecular regression datasets showing good performance. 

Dai et al (2016) introducted a simple GNN layer inspired by mean-field infer- 
ence. Concretely, given a graph Y, the feature f’(v) for node v at layer t is computed 
as 
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f'(v) =f) Wit X £w) w), (9.1) 
wEN(v) 


where W; and W, are parameter matrices in RIX, which are shared across layers, 
and o(-) is a component-wise non-linearity. The above layer is evaluated on stan- 
dard, small-scale benchmark datasets (Kersting et al, 2016) showing good perfor- 
mance, similar to classical kernel approaches. Lei et al (2017a) proposed a similar 
layer and showed a connection to kernel approaches by deriving the corresponding 
kernel space of the learned graph embeddings. 

To explicitly support edge labels, e.g., chemical bonds, Simonovsky and Ko- 
modakis (2017) introduced Edge-Conditioned Convolution, where a feature for 
node v is represented as 


Here /(w,v) is the feature (or label) of the edge shared by the nodes v and w. More- 
over, F 1; RS — R“*4-1 is a function, where s denotes the number of components 
of the edge features and d; and d;_; denotes the number of components of the fea- 
tures of layer ¢ and (t — 1), respectively, mapping the edge feature to a matrix in 
R%@*4-1, Further, the function F’ is parameterized by the matrix W, conditioned on 
the edge feature /. Finally, b’ is a bias term, again conditioned on the edge feature 
l. The above layer is applied to graph classification tasks on small-scale, standard 
benchmark datasets (Kersting et al, 2016), and point cloud data from the computer 
vision. 

Building on (Scarselli et al, 2008), Gilmer et al (2017) introduced a general 
message-passing framework, unifying most of the proposed GNN architectures so- 
far. Specifically, Gilmer et al (2017) replaced the inner sum defined over the neigh- 
borhood in the above equations by a general permutation-invariant, differentiable 
function, e.g., a neural network, and substituted the outer sum over the previous 
and the neighborhood feature representation, e.g., by a column-wise vector concate- 
nation or LSTM-style update step. Thus, in full generality a new feature f’(v) is 
computed as 


mize (E O) Sie HEE") wE NOR), (9.2) 


where i aggregates over the multi-set of neighborhood features and te 
merges the node’s representation from step (t — 1) with the computed neighbor- 
hood features. Moreover, it is straighfoward to include edge features as well, e.g., by 
learning a combined feature representation of the node itself, the neighboring node, 
and the corresponding edge feature. Gilmer et al (2017) employed the above ar- 
chitecture for regression tasks from quantum chemistry, showing promising perfor- 
mance for regression targets computed by expensive numerical simulations (namely, 
DFT) (Wt et al, 2018; Ramakrishnan et al, 2014). 
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Concurrently with (Morris et al, 2020b), Xu et al (2019d) investigated the limits 
of currently used GNN architectures, showing that their expressiveness is bounded 
by the WL algorithm, a simple heuristic for the graph isomorphism problem. Specif- 
ically, they showed that there does not exist a GNN architecture that can distinguish 
non-isomorphic graphs that the former algorithm cannot. On the positive side, they 
proposed the Graph Isomorphism Network (GIN) layer and showed that there exists 
a parameter initialization such that it is as expressive as the WL algorithm. Formally, 
given a graph @ the feature of node v at layer t is computed as 


f(v) =MLP((1 +e): ye o)+ E fw (w)), (9.3) 
weN(v) 


where MLP is a standard multi-layer perceptron, and £ is a learnable scalar value. 
Xu et al (2019d) used standard sum pooling, see below, and achieved good results 
on standard benchmark datasets compared to other standard GNN layers and kernel 
approaches Morris et al (2020a). 

Xu et al (2018a) investigated how to combine local information at different dis- 
tances from the target node. Concretely, they investigated different architectural 
design choices for achieving this, e.g., concatenation, max pooling, and LSTM- 
style attention, showing mild performance improvements on standard benchmark 
datasets. Moreover, they drew some connection to random-walk distributions. 

Niepert et al (2016) studied neural architectures for graph classification by ex- 
tracting local patterns from graphs. Starting from each vertex, the approach explores 
the vertex’s k-hop neighborhood, e.g., by using a breadth-first strategy. Using a 
labeling algorithm, e.g., a centrality index, the vertices in this neighborhood are 
ordered to transform into a fixed-size vector. Afterwards, a CNN-like neural net- 
work followed by an MLP is used to perform the final graph classification. The ap- 
proach is compared to graph kernel approaches on standard, small-scale benchmark 
datasets (Kersting et al, 2016) showing promising performance. 

Corso et al (2020) investigated the effect and limits of neighborhood aggregation 
functions. They devised aggregation schemes based on multiple aggregators, e.g., 
sum, mean, minimum, maximum, and standard deviation, together with so-called 
degree scalar, which combat negative effects due to a different number of neighbors 
between nodes. Specifically, they introduced the scalar 


S(d,a) = (0), d >0,œ € |-1,1], 


where 
Ł log(di+ 1), 


~ [train] train | i€train 


and œ is a variable parameter. Here, the set train contains all nodes i in the training 
set and d; denotes its degree, resulting in the aggregation function 
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I u 
@G=| sW.a=1) e] 
S(D, a = —1) ; 
N , min 
scalers 
aggregators 


where ® denotes the tensor product. The authors report promising performance over 
standard aggregation functions on a wide range of standard benchmark datasets, 
improving over some standard GNN layers. 

Vignac et al (2020b) extended the expressivity of GNNs, see also Section 9.4, by 
using unique node identifiers, generalizing the message-passing scheme proposed 
by (Gilmer et al, 2017), see Equation (9.2), by computing and passing matrix fea- 
tures instead of vector features. Formally, each node i maintains a matrix U; in R"**, 
denoted local context, where the j-th row contains the vectorial representation of 
node j of node i. At initialization, each local context U; is set to 1 in R”” ! where n 
denotes the number of nodes in the given graph. Now at each layer /, similar to the 
above message-passing framework, the local context is updated as 


CM) OO AO) apie cae HO OD 770 .. 
U; u (v, ,U; J)ER with U; o ({n (U; U} DY eno)” 


where u” ,m®, and ġ are update, message, and aggregation functions, respectively, 
to compute the updated local context, and y;; denotes the edge features shared by 
node i and j. Moreover, the authors study the expressive power, showing that, in 
principle, the above layer can distinguish any non-isomorphic pair of graphs and 
propose more scalable alternative variants of the above architecture. Finally, promis- 
ing results on standard benchmark datasets are reported. 


9.2.2 Spectral approaches 


Spectral approaches apply a convolution operator in the spectral domain of the 
graph’s Laplacian matrix, either by directly computing the former’s eigendecom- 
position or by relying on spectral graph theory, see (Chami et al, 2020; Wang et al, 
2018a) for more details. Moreover, they have a solid mathematical foundation stem- 
ming from signal processing, see, e.g., (Sandryhaila and Moura, 2013; Shuman et al, 
2013). 

Formally, let ¥ be an undirected graph on n nodes with adjacency matrix A, then 
the graph Laplacian 


L=I—D-2AD-2 


of the graph Y, where D is the diagonal matrix of node degrees, i.e., Dj = Lj (Aij). 
Since the graph Laplacian is positive semi-definite, we can factor it as 
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L=UAU', 


where U = [uy,...,Uy]| in R”*” denotes the matrix of eigenvectors, sorted according 
to their eigenvalues. Further, the matrix A is a diagonal matrix with A; ; = A;, where 
A; denotes the ith eigenvalue. Let x in R” be a graph signal, i.e., a node feature, then 
the graph Fourier transform and its inverse for x is 


F(x)=U'x and F !(%)=Ux, 


respectively, where £ = F (x). Hence, formally, the graph Fourier transform is an 
orthonormal (linear) transform to the space spanned by the basis of the eigenvectors 
in U; consequently, each element x = )'; X;- uj. 

Based on this observation, spectrum-based methods generalize convolution (e.g., 
on grids) to graphs. Thereto, they learn a convolution filter g. Formally, this can be 
expressed as follows: 


x*g=U(U'x@U'g) =U -diag(U'g)-U'x, 


where the operator - denotes the elementwise product. If we set gg = diag(U'g), 
the above can be expressed as 


X* 26 =Ug,eU'x. 


Then most spectral approaches differ in their implementation of the operator gg. 

For example, Spectral Convolutional Neural Networks (Bruna et al, 2014) set 
ga = OH j» Which is a set of learnable parameters. Based on this, they proposed the 
following spectral GNN layer: 


f 
E a t T pyt-1 
=o (Sue1,u H‘ ) 


for jin {1,2,...,t}. Here, t is the layer index, H‘~! in R"*(—) is the graph signal, 
where H? = X, i.e., the given graph features, and ©! , is a diagonal parameter matrix. 
However, the above layer suffers from a number of drawbacks: The bases of the 
eigenvectors is not permution invariant, the layer cannot be applied to a graph with 
a different structure, and the computation of the eigendecomposition is cubic in the 
number of nodes. Hence, Henaff et al (2015) proposed more scalable variants of 
the above layer by building on a smoothness notion in the spectral domain, which 
reduces the numbers of parameters and acts as a regulizer. 

To further make the above layer more scalable, Defferrard et al (2016) intro- 
duced Chebyshev Spectral CNNs, which approximates gg by a Chebyshev expan- 
sion (Hammond et al, 2011). Namely, they express 


K 
8&0 = 2 6,7;(A), 
i=0 
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where A = 2A /Amax —1, and Amax denotes the largest eigenvalue of the normalized 
Laplacian A. The normalization ensures that the eigenvalues of the Laplacian are 
in the [—1,1] real interval, which is required by Chebyshev polynomials. Here, T; 
denotes the ith Chebyshev polynomial with Tı (x) = x. Alternatively, Levie et al 
(2019) used Caley polynomials, and show that Chebyshev Spectral CNNs are a 
special case. 

Kipf and Welling (2017b) proposed to make Chebyshev Spectral CNNs more 
scalable by setting 


X* gg = Oox— 0,D-2AD~?x. 


Further, they improved the generalization ability of the resulting layer by setting 
6 = 0o = — 04, resulting in 


Xk gg = @([+D~2AD~2)x. 


In fact, the above layer can be understood as a spatial GNN, i.e., it is equivalent to 
computing a feature 


f'(v)=0 ( 2 = f=! (w) w) ; 


wEN(v)Uv 


for node v in the given graph Y, where d, and dẹ denote the degrees of node v and w, 
respectively. Although the above layer was originally proposed for semi-supervised 
node classification, it is now one of the most widely used ones and has been ap- 
plied for tasks such as matrix completion (van den Berg et al, 2018), link predic- 
tion (Schlichtkrull et al, 2018), and also as a baseline for graph classification (Ying 
et al, 2018c). 


9.3 Pooling layers: Learning graph-level outputs from node-level 
outputs 


Since GNNs learn vectorial node representations, using them for graph classification 
requires a pooling layer, enabling going from node to graph-level output. Formally, a 
pooling layer is a parameterized function that maps a multiset of vectors, i.e., learned 
node-level representations, to a single vector, i.e., the graph-level representation. 
Arguably, the simplest of such layers are sum, mean, and min or max pooling. That 
is, given a graph Y and a multiset 


M = {f(v) ER |ve Vv} 
of node-level representations of nodes in the graph Y, sum pooling computes 


Spool (Y) = L f(v), 
fi 


(v)EM 
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while mean, min, max pooling take the (component-wise) average, minimum, max- 
imum over the elements in M, respectively. These four simple pooling layers are 
still used in many published GNN architectures, e.g., see (Duvenaud et al, 2015b). 
In fact, recent work (Mesquita et al, 2020) showed that more sophisticated layers, 
e.g., relying on clustering, see below, do not offer any empirical benefits on many 
real-world datasets, especially those from the molecular domain. 


9.3.1 Attention-based pooling layers 


Simple attention-based pooling became popular in recent years due to its easy im- 
plementation and scalability compared to more sophisticated alternatives; see be- 
low. For example, Gilmer et al (2017), see above, used a seq2seq architecture for 
sets (Vinyals et al, 2016) for pooling purposes in their empirical study. Focusing 
on pooling for GNNs, Lee et al (2019b) introduced the SAGPool layer, short for 
Self-Attention Graph Pooling method for GNNs, using self-attention. Specifically, 
they computed a se/f-attention score by multiplying the aggregated features of an 
arbitrary GNN layer by a matrix Ozu in R¢*!, where d denotes the number of com- 
ponents of the node features. For example, computing the self-attention score Z(v) 
for the simple layer of Equation (9.1) equates to 


Z(v) =o (w Wi + y f! (w) w) - Oat- 


wEN(v) 


The self-attention score Z(v) is subsequently used to select the top-k nodes in the 
graph; similarly, to Cangea et al (2018) and (Gao et al, 2018a), see below, omitting 
the other nodes, effectively pruning nodes from the graph. Similar attention-based 
techniques are proposed in (Huang et al, 2019). 


9.3.2 Cluster-based pooling layers 


The idea of cluster-based pooling layers is to coarsen the graph, i.e., merging similar 
nodes iteratively. One of the earliest uses has been proposed in (Simonovsky and 
Komodakis, 2017), see above, where the Graclus clustering algorithm (Dhillon et al, 
2007) is used. However, one has the note that the algorithm is parameter-free, i.e., it 
does adapt to the learning task at hand. 

The arguably most well-known cluster-based pooling layer is DiffPool (Ying 
et al, 2018c). The idea of DiffPool is to iteratively coarsen the graph by learn- 
ing a soft clustering of nodes, making the otherwise discrete clustering assignment 
differentiable. Concretely, at layer t, DiffPool learns a soft cluster assigment S in 
[0, 1]""*""+1, where n; and n,+1 are the number of nodes at layer t and (t + 1), respec- 
tively. Each entry S; j represents the probablity of node i of layer t being clustered 
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into node j of layer (t +1). In each iteration, the matrix S is computed by 
S = softmax(GNN(A;,/;)), 


where A; and F; are the adjacency matrix and the feature matrix of the clustered 
graph at layer t, and the function GNN is an abitrary GNN layer. Finally, in each 
layer, the adjacency matrix and the feature matrix are updated as 


Arn =S'AS and Rp =S'F,, 


respectively. 

Empirically, the authors show that the DiffPool layer boosts standard GNN lay- 
ers’ performance, e.g., GraphSage (Hamilton et al, 2017b), on standard, small-scale 
benchmark datasets (Morris et al, 2020a). The downside of the above layer is the 
added computational cost. The adjacency matrix becomes dense and real-valued af- 
ter the first pooling layer, leading to a quadratic cost in the number of nodes for 
each GNN layer’s computation. Moreover, the number of clusters has to be chosen 
in advance, leading to an increase in hyperparameters. 


9.3.3 Other pooling layers 


Zhang et al (2018g) proposed a pooling layer based on differentiable sorting, de- 
noted SortPooling. That is, given the feature matrix F, of row-wise node features 
after layer t, SortPooling sorts the rows of F, in a descending fashion. It truncates 
the last n — k rows of F, or pads with zero rows if n < k for a given graph to unify 
the graphs’ size. Formally, the layer can be written down as 


F=sort(F,) followed by  Firunc = truncate(F,k), 


where the function sort sorts the feature matrix F; row-wise in a descending fashion, 
and the functions truncate return the first k of the input matrix. Ties are broken up 
using the features from previous layers, 1 to (t — 1). The resulting tensor Firunc of 
shape k x r4 di, where d; denotes the number of features of the ith layer and h 
the total number of layers, is reshaped into a tensor of size Ro di) x 1, row- 
wise, followed by a standard 1-D convolution with a filter and step size of Pe dj. 
Finally, a sequence of max-pooling and 1-D convolutions are applied to identifiy 
local patterns in the sequence. 

Similarly, to combat the high computational cost of some pooling layer, e.g., 
DiffPool, Cangea et al (2018) introduced a pooling layer dropping n — [nk] nodes 
of a graph with n nodes in each layer for k in [0,1). The nodes to be dropped are 
choosen according to a projection score against a learnable vector p. Concretly, they 
compute the score vector 
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F- 
y= Ti and J =top-k(y,k), 


where top-k returns top-k indices from a given vector according to y. Finally, the 
adjacency A;+, is updated by removing rows and columns that are not in J, while 
the updated feature matrix 


Fis1 = (F; © tanb(y)). 


The authors report slightly lower classification accuracies than the DiffPool layer 
on most employed datasets while being much faster in computation time. A similar 
approach was presented in (Gao and Ji, 2019). 

To derive more expressive graph representations, Murphy et al (2019c,b) propose 
relational pooling. To increase the expressive power of GNN layers, they average 
over all permutations of a given graph. Formally, let Y be a graph, then a represen- 
tation 

1 
(9) = È 8(Ara [Fr tyi]) (9.4) 
171 meN 

is learned, where IT denotes all possible permutations of the rows and columns of 
the adjacency matrix of Y, g is a permutation-invariant function, and [-,-] denotes 
column-wise matrix concatenation. Moreover. Az „ permutes the rows and columns 
of the adjaceny matrix A according to the permutation z in IT, similarly Fy permutes 
the rows of the feature matrix F. The author showed that the above architecture 
is more expressive in terms of distinguishing non-isomorphic graphs than the WL 
algorithm, and proposed sampling-based techniques to speed up the computation. 

Bianchi et al (2020) introduced a pooling layer based on spectral clustering (VON- 
LUXBURG, 2007). Thereto, they train a GNN together with an MLP, followed by 
a softmax function, against an approximation of a relaxed version of the k-way 
normalized Min-cut problem (Shi and Malik, 2000). The resulting cluster assign- 
ment matrix S is used in the same way as in Section 9.3.2. The authors evaluated 
their approach on standard, small-scale benchmark datasets showing promising per- 
formance, especially over the DiffPool layer. For another pooling layer based on 
spectral clustering, see (Ma et al, 2019d). 


9.4 Limitations of graph neural networks and higher-order 
layers for graph classification 


In the following, we briefly survey the limitations of GNNs and how their expressive 
power is upper-bounded by the Weisfeiler-Leman method (Weisfeiler and Leman, 
1968; Weisfeiler, 1976; Grohe, 2017). Concretely, a recent line of works by Morris 
et al (2020b); Xu et al (2019d); Maron et al (2019a) connects the power or expressiv- 
ity of GNNs to that of the WL algorithm. The results show that GNN architectures 
generally do not have more power to distinguish between non-isomorphic graphs 
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than the WL. That is, for any graph structure that the WL algorithm cannot dis- 
tinguish, any possible GNN with any possible choices of parameters will also not 
be able to distinguish it. On the positive side, the second result states that there is a 
sequence of parameter initializations such that GNNs have the same power in distin- 
guishing non-isomorphic (sub-)graphs as the WL algorithm, see also Equation (9.3). 
However, the WL algorithm has many short-comings, see (Arvind et al, 2015; Kiefer 
et al, 2015), e.g., it cannot distinguish between cycles of different lengths, an impor- 
tant property for chemical molecules, and is not able to distinguish between graphs 
with different triangle counts, an important property of social networks. 

To address this, many recent works tried to build provable more expressive GNNs 
for graph classification. For example, in (Morris et al, 2020b; Maron et al, 2019b, 
2018) the authors proposed higher-order GNN architectures that have the same ex- 
pressive power as the k-dimensional Weisfeiler-Leman algorithm (k-WL), which is, 
as k grows, a more expressive generalization of the WL algorithm. In the following, 
we give an overview of such works. 


9.4.1 Overcoming limitations 


The first GNN architecture that overcame the limitations of the WL algorithm was 
proposed in (Morris et al, 2020b). Specifically, they introduced so-called k-GNNs, 
which work by learning features over the set of subgraphs on k nodes instead of 
vertices by defining a notion of neighborhood between these subgraphs. Formally, 
for a given k, they consider all k-element subsets [Y]* over Y. Let s = {51,...,5¢} 
be a k-set in [Y]‘, then they define the neighborhood of s as 


N(s) = {t € [V] | |snt| =k-1}. 


The local neighborhood N_(s) consists of all t in N(s) such that (v,w) in & for the 
unique v € s\t and the unique w € ft \ s. The global neighborhood Ng(s) then is 
defined as N(s) \ ML (s). 

Based on this neighborhood definition, one can generalize most GNN layers for 
vertex embeddings to more expressive subgraph embeddings. Given a graph Y, a 
feature for a subgraph s can be computed as 


t(s) = o (f(s) Wi + £ t'u) wi). (9.5) 
uENz(s)UNG(s) 


The authors resort to sum over the local neighborhood in the experiments for better 
scalability and generalization, showing a significant boost over standard GNNs on a 
quantum chemistry benchmark dataset (Wu et al, 2018; Ramakrishnan et al, 2014). 

The latter approach was refined in (Maron et al, 2019a) and (Morris et al, 2019). 
Specifically, based on (Maron et al, 2018), Maron et al (2019a) derived an architec- 
ture based on standard matrix multiplication that has at least the same power as the 
3-WL. Morris et al (2019) proposed a variant of the k-WL that, unlike the original 


9 Graph Neural Networks: Graph Classification 191 


algorithm, takes the sparsity of the underlying graph into account. Moreover, they 
showed that the derived sparse variant is slightly more powerful than the k-WL in 
distinguishing non-isomorphic graphs and proposed a neural architecture with the 
same power as the sparse k-WL variant. 

An important direction in studying graph representations’ expressive power was 
taken by (Chen et al, 2019f). The authors prove that a graph representation can 
approximate a function f if and only if it can distinguish all pairs of non-isomorphic 
graphs Y and # where f(Y) 4 f(#). With that in mind, they established an 
equivalence between the set of pairs of graphs a representation can distinguish and 
the space of functions it can approximate, further introducing a variation of the 2- 
WL. 

Bouritsas et al (2020) enhanced the expressivity of GNNs by annotating node 
features with subgraph information. Specifically, by fixing a set of predefined, small 
subgraphs, they annotated each node with their role, formally their automorphism 
type, in these subgraphs, showing promising performance gains on standard bench- 
mark datasets for graph classification. 

Beaini et al (2020) studied how to incorporate directional information into GNNs. 
Finally, You et al (2021) enhanced GNNs by uniquely coloring central vertices and 
used two types of message functions to surpass the expressive power of the 1-WL, 
while Sato et al (2021) and Abboud et al (2020) use random features to achieve 
the same goal and additionally studied the universality properties of their derived 
architectures. 


9.5 Applications of graph neural networks for graph 
classification 


In the following, we highlight some application areas of GNNs for graph classifi- 
cation, focusing on the molecular domain. One of the most promising applications 
of GNNs for graph classification is pharmaceutical drug research, see (Gaudelet 
et al, 2020) for an overview. In this direction, a promosing approach was proposed 
by (Stokes et al, 2020). They used a form of directed message passing neural net- 
works operating on molecular graphs to identify repurposing candidates for antibi- 
otic development. Moreover, they validated their predictions in vivo, proposing suit- 
able repurposing candidates different from know ones. 

Schweidtmann et al (2020) used 2-GNNs, see Equation (9.5), to derive GNN 
models for predicting three fuel ignition quality indicators such as the derived cetane 
number, the research octane number,and the motor octane number of oxygenated 
and non-oxygenated hydrocarbons, indicating that the higher-order layers of Equa- 
tion (9.5) provide significant gains over standard GNNs in the molecular learning 
domain. 

A general principled GNN for the molecular domain, denoted DimeNet, was in- 
troduced by (Klicpera et al, 2020). By using an edge-based architecture, they com- 
puted a message coefficient between atoms based on their relative positioning in 3D 
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space. Concretely, an incoming message to a node is based on the sender’s incoming 
meassage as well as the distance between the atoms and the angles of their atomic 
bonds. By using this additional information the authors report significant improve- 
ments over state-of-the-art GNN models in molecular property prediction tasks . 


9.6 Benchmark Datasets 


Since most developments for GNNs are driven empirically, i.e., based on evalua- 
tions on standard benchmark datasets, meaningful benchmark datasets are crucial 
for the development of GNNs in the context of graph classification. Hence, the re- 
search community has established several widely used repositories for benchmark 
datasets for graph classification. Two such repositories are worth being highlighted 
here. First, the TUDataset (Morris et al, 2020a) collection contains over 130 datasets 
provided at www. graphlearning. io of various sizes and various areas such as 
chemistry, biology, and social networks. Moreover, it provides Python-based data 
loaders and baseline implementations of standards graph kernel and GNNs. More- 
over, the datasets can be easily accessed from well-known GNN implementation 
frameworks such as Deep Graph Library (Wang et al, 2019f), PyTorch Geomet- 
ric (Fey and Lenssen, 2019), or Spektral (Grattarola and Alippi, 2020). Secondly, 
the OGB (Weihua Hu, 2020) collections contain many large-scale graph classifica- 
tion benchmark datasets, e.g., from chemistry and code analysis with data loaders, 
prespecified splits, and evaluation protocols. Finally, Wu et al (2018) also provides 
many large-scale datasets from chemo- and bioinformatics. 


9.7 Summary 


We gave an overview of GNNs for graph classification. We surveyed classical and 
modern works in this area, distinguishing between spatial and spectral approaches. 
Since GNNs compute node-level representations, pooling layers for learning graph- 
level representations is crucial for successful graph classification. Hence, we sur- 
veyed pooling layers based on attention, clustering, and other approaches to pool- 
ing. Moreover, we gave an overview of the limitations of GNNs for graph classifi- 
cation and surveyed architectures to overcome these limitations. Finally, we gave an 
overview of applications of GNNs and benchmark datasets for their evaluation. 
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Editor’s Notes: The success of using GNNs in classification tasks is ow- 
ing to advanced representation learning (chapter 2) by expressive power of 
GNNs (chapter 5). And its performance is limited by the scalability (chap- 
ter 6), robustness (chapter 8) and transformation capability (chapter 12) of 
algorithm. As one of the most prominent tasks, one can always face classi- 
fication in a variety of GNN topic. For example, node classification helps to 
evaluate performance of AutoML (chapter17) and self-supervised learning 
(chapter 18) methods of GNNs, graph classification can be token as subpart 
of adversarial learning in graph generation (chapter 11). Further, there are 
many promising applications of GNNs in classification, node or edge based 
ones like urban intelligence (chapter 27), graph based ones like protein and 
drug prediction (chapter 25). 
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Graph Neural Networks: Link Prediction 


Muhan Zhang 


Abstract Link prediction is an important application of graph neural networks. By 
predicting missing or future links between pairs of nodes, link prediction is widely 
used in social networks, citation networks, biological networks, recommender sys- 
tems, and security, etc. Traditional link prediction methods rely on heuristic node 
similarity scores, latent embeddings of nodes, or explicit node features. Graph neu- 
ral network (GNN), as a powerful tool for jointly learning from graph structure and 
node/edge features, has gradually shown its advantages over traditional methods for 
link prediction. In this chapter, we discuss GNNs for link prediction. We first in- 
troduce the link prediction problem and review traditional link prediction methods. 
Then, we introduce two popular GNN-based link prediction paradigms, node-based 
and subgraph-based approaches, and discuss their differences in link representation 
power. Finally, we review recent theoretical advancements on GNN-based link pre- 
diction and provide several future directions. 


10.1 Introduction 


Link prediction is the problem of predicting the existence of a link between two 
nodes in a network (Liben-Nowell and Kleinberg, 2007). Given the ubiquitous ex- 
istence of networks, it has many applications such as friend recommendation in 
social networks (Adamic and Adar, 2003), co-authorship prediction in citation net- 
works (Shibata et al, 2012), movie recommendation in Netflix (Bennett et al, 2007), 
protein interaction prediction in biological networks (Qi et al, 2006), drug response 
prediction (Stanfield et al, 2017), metabolic network reconstruction (Oyetunde et al, 
2017), hidden terrorist group identification (Al Hasan and Zaki, 2011), knowledge 
graph completion (Nickel et al, 2016a), etc. 
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Link prediction has many names in different application domains. The term “link 
prediction” often refers to predicting links in homogeneous graphs, where nodes and 
links both only have a single type. This is the simplest setting and most link predic- 
tion works focus on this setting. Link prediction in bipartite user-item networks is 
referred to as matrix completion or recommender systems, where nodes have two 
types (user and item) and links can have multiple types corresponding to different 
ratings users can give to items. Link prediction in knowledge graphs is often re- 
ferred to as knowledge graph completion, where each node is a distinct entity and 
links have multiple types corresponding to different relations between entities. In 
most cases, a link prediction algorithm designed for the homogeneous graph setting 
can be easily generalized to heterogeneous graphs (e.g., bipartite graphs and knowl- 
edge graphs) by considering heterogeneous node type and relation type information. 

There are mainly three types of traditional link prediction methods: heuris- 
tic methods, latent-feature methods, and content-based methods. Heuristic meth- 
ods compute heuristic node similarity scores as the likelihood of links (Liben- 
Nowell and Kleinberg, 2007). Popular ones include common neighbors (Liben- 
Nowell and Kleinberg, 2007), Adamic-Adar (Adamic and Adar, 2003), preferen- 
tial attachment (Barabasi and Albert, 1999), and Katz index (Katz, 1953). Latent- 
feature methods factorize the matrix representations of a network to learn low- 
dimensional latent representations/embeddings of nodes. Popular network embed- 
ding techniques such as DeepWalk (Perozzi et al, 2014), LINE (Tang et al, 2015b) 
and node2vec (Grover and Leskovec, 2016), are also latent-feature methods because 
they implicitly factorize some matrix representations of networks too (Qiu et al, 
2018). Both heuristic methods and latent-feature methods infer future/missing links 
leveraging the existing network topology. Content-based methods, on the contrary, 
leverage explicit node attributes/features rather than the graph structure (Lops et al, 
2011). It is shown that combining the graph topology with explicit node features 
can improve the link prediction performance (Zhao et al, 2017). 

By learning from graph topology and node/edge features in a unified way, graph 
neural networks (GNNs) recently show superior link prediction performance than 
traditional methods (Kipf and Welling, 2016; Zhang and Chen, 2018b; You et al, 
2019; Chami et al, 2019; Li et al, 2020e). There are two popular GNN-based link 
prediction paradigms: node-based and subgraph-based. Node-based methods first 
learn a node representation through a GNN, and then aggregate the pairwise node 
representations as link representations for link prediction. An example is (Varia- 
tional) Graph AutoEncoder (Kipf and Welling, 2016). Subgraph-based methods first 
extract a local subgraph around each target link, and then apply a graph-level GNN 
(with pooling) to each subgraph to learn a subgraph representation, which is used as 
the target link representation for link prediction. An example is SEAL (Zhang and 
Chen, 2018b). We introduce these two types of methods separately in Section 10.3.1 
and 10.3.2, and discuss their expressive power differences in Section 10.3.3. 

To understand GNNs’ power for link prediction, several theoretical efforts have 
been made. The y-decaying theory (Zhang and Chen, 2018b) unifies existing link 
prediction heuristics into a single framework and proves their local approximability, 
which justifies using GNNs to “learn” heuristics from the graph structure instead of 
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using predefined ones. The theoretical analysis of labeling trick (Zhang et al, 2020c) 
proves that subgraph-based approaches have a higher link representation power than 
node-based approaches by being able to learn most expressive structural representa- 
tions of links (Srinivasan and Ribeiro, 2020b) where node-based approaches always 
fail. We introduce these theories in Section 20.3. 

Finally, by analyzing limitations of existing methods, we provide several future 
directions on GNN-based link prediction in Section 20.4. 


10.2 Traditional Link Prediction Methods 


In this section, we review traditional link prediction methods. They can be cate- 
gorized into three classes: heuristic methods, latent-feature methods, and content- 
based methods. 


10.2.1 Heuristic Methods 


Heuristic methods use simple yet effective node similarity scores as the likelihood 
of links (Liben-Nowell and Kleinberg, 2007; Lü and Zhou, 2011). We use x and y 
to denote the source and target node between which to predict a link. We use T (x) 
to denote the set of x’s neighbors. 


10.2.1.1 Local Heuristics 


One simplest heuristic is called common neighbors (CN), which counts the number 
of neighbors two nodes share as a measurement of their likelihood of having a link: 


fen(%y) = |C NAT O)|. (10.1) 


CN is widely used in social network friend recommendation. It assumes that the 
more common friends two people have, the more likely they themselves are also 
friends. 

Jaccard score measures the proportion of common neighbors instead: 


_ EWN) 


= FOTO] ge 


Siaccard (x, y) 


There is also a famous preferential attachment (PA) heuristic (Barabasi and Al- 
bert, 1999), which uses the product of node degrees to measure the link likelihood: 


fea y) = TlO). (10.3) 
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Fig. 10.1: Illustration of three link prediction heuristics: CN, PA and AA. 


PA assumes x is more likely to connect to y if y has a high degree. For example, 
in citation networks, a new paper is more likely to cite those papers which already 
have a lot of citations. Networks formed by the PA mechanism are called scale- 
free networks (Barabasi and Albert, 1999), which are important subjects in network 
science. 

Existing heuristics can be categorized based on the maximum hop of neighbors 
needed to calculate the score. CN, Jaccard, and PA are all first-order heuristics, 
because they only involve one-hop neighbors of two target nodes. Next we introduce 
two second-order heuristics. 

The Adamic-Adar (AA) heuristic (Adamic and Adar, 2003) considers the weight 
of common neighbors: 


1 
faes Yo —_, (10.4) 
ze (x)A (y) log IP(z)| 


where a high-degree common neighbor z is weighted less (down-weighted by the 
reciprocal of log |I (z)|). The assumption is that a high degree node connecting to 
both x and y is less informative than a low-degree node. 

Resource allocation (RA) (Zhou et al, 2009) uses a more aggressive down- 
weighting factor: 


frasyy= yo (10.5) 


zer (x)A (y) Ir) 


thus, it favors low-degree common neighbors more. 

Both AA and RA are second-order heuristics, as up to two hops of neighbors of x 
and y are required to compute the score. Both first-order and second-order heuristics 
are local heuristics, as they can all be computed from a local subgraph around the 
target link without the need to know the entire network. We illustrate three local 
heuristics, CN, PA, and AA, in Figure 10.1. 
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10.2.1.2 Global Heuristics 


There are also high-order heuristics which require knowing the entire network. 
Examples include Katz index (Katz, 1953), rooted PageRank (RPR) (Brin and Page, 
2012), and SimRank (SR) (Jeh and Widom, 2002). 

Katz index uses a weighted sum of all the walks between x and y where a longer 
walk is discounted more: 


co 


fkaz(x,y) = Lp ‘walks? (x,y)|. (10.6) 


Here $ is a decaying factor between 0 and 1, and {walks (x, y)| counts the length-/ 
walks between x and y. When we only consider length-2 walks, Katz index reduces 
to CN. 

Rooted PageRank (RPR) is a generalization of PageRank. It first computes the 
stationary distribution 7, of a random walker starting from x who randomly moves 
to one of its current neighbors with probability œ, or returns to x with probability 
1 —a. Then it uses Ty at node y (denoted by [7;],) to predict link (x,y). When the 
network is undirected, a symmetric version of rooted PageRank uses 


free (x,y) = [T]y + [T] (10.7) 


to predict the link. 

The SimRank (SR) score assumes that two nodes are similar if their neighbors 
are also similar. It is defined in a recursive way: if x = y, then fsg (x,y) := 1; other- 
wise, 


a b 
fsr(x,y) := ye a ) (10.8) 


where yis a constant between 0 and 1. 

High-order heuristics are global heuristics. By computing node similarity from 
the entire network, high-order heuristics often have better performance than first- 
order and second-order heuristics. 


10.2.1.3  Summarization 


We summarize the eight introduced heuristics in Table 10.1. For more variants of the 
above heuristics, please refer to (Liben-Nowell and Kleinberg, 2007; Lü and Zhou, 
2011). Heuristic methods can be regarded as computing predefined graph structure 
features located in the observed node and edge structures of the network. Although 
effective in many domains, these handcrafted graph structure features have limited 
expressivity—they only capture a small subset of all possible structure patterns, and 
cannot express general graph structure features underlying different networks. Be- 
sides, heuristic methods only work well when the network formation mechanism 
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aligns with the heuristic. There may exist networks with complex formation mech- 
anisms which no existing heuristics can capture well. Most heuristics only work for 
homogeneous graphs. 


Table 10.1: Popular heuristics for link prediction 


Name Formula Order 
common neighbors rnr) first 
Jaccard ean first 
preferential attachment |I"(x)|-|I'(y)| first 
Adamic-Adar Eeron o) mre second 
resource allocation Leron o) ra second 
Katz E B' |walks"? (x,y)| high 
rooted PageRank [Tel + [Ay] high 
SimRank yrei ee?) high 


Notes: I(x) denotes the neighbor set of vertex x. B < 1 is a damping factor. |walks (x,y)| counts 
the number of length-/ walks between x and y. [7], is the stationary distribution probability of y 
under the random walk from x with restart, see (Brin and Page, 2012). SimRank score uses a 
recursive definition. 


10.2.2 Latent-Feature Methods 


The second class of traditional link prediction methods is called latent-feature meth- 
ods. In some literature, they are also called latent-factor models or embedding meth- 
ods. Latent-feature methods compute latent properties or representations of nodes, 
often obtained by factorizing a specific matrix derived from the network, such as the 
adjacency matrix and the Laplacian matrix. These latent features of nodes are not 
explicitly observable—they must be computed from the network through optimiza- 
tion. Latent features are also not interpretable. That is, unlike explicit node features 
where each feature dimension represents a specific property of nodes, we do not 
know what each latent feature dimension describes. 


10.2.2.1 Matrix Factorization 


One most popular latent feature method is matrix factorization (Koren et al, 2009; 
Ahmed et al, 2013), which is originated from the recommender systems literature. 
Matrix factorization factorizes the observed adjacency matrix A of the network into 
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the product of a low-rank latent-embedding matrix Z and its transpose. That is, it 
approximately reconstructs the edge between i and j using their k-dimensional latent 
embeddings z; and Zj: 


A 


Aij =Z] Zj, (10.9) 


It then minimizes the mean-squared error between the reconstructed adjacency ma- 
trix and the true adjacency matrix over the observed edges to learn the latent em- 
beddings: 


1 E 
L = ral L (Ai j —Âi j}. (10.10) 
(jee 


Finally, we can predict new links by the inner product between nodes’ latent em- 
beddings. Variants of matrix factorization include using powers of A (Cangea et al, 
2018) and using general node similarity matrices (Ou et al, 2016) to replace the 
original adjacency matrix A. If we replace A with the Laplacian matrix L and define 
the loss as follows: 


Z= YP |lzi—z,\\5, (10.11) 
(Ee 


then the nontrivial solution to the above are constructed by the eigenvectors corre- 
sponding to the k smallest nonzero eigenvalues of L, which recovers the Laplacian 
eigenmap technique (Belkin and Niyogi, 2002) and the solution to spectral cluster- 
ing (VONLUXBURG, 2007). 


10.2.2.2 Network Embedding 


Network embedding methods have gained great popularity in recent years since 
the pioneering work DeepWalk (Perozzi et al, 2014). These methods learn low- 
dimensional representations (embeddings) for nodes, often based on training a skip- 
gram model (Mikolov et al, 2013a) over random-walk-generated node sequences, 
so that nodes which often appear nearby each other in a random walk (i.e., nodes 
close in the network) will have similar representations. Then, the pairwise node 
embeddings are aggregated as link representations for link prediction. Although 
not explicitly factorizing a matrix, it is shown in (Qiu et al, 2018) that many net- 
work embedding methods, including LINE (Tang et al, 2015b), DeepWalk, and 
node2vec (Grover and Leskovec, 2016), implicitly factorize some matrix representa- 
tions of the network. Thus, they can also be categorized into latent-feature methods. 
For example, DeepWalk approximately factorizes: 


w 


L (0 '4)")D"') —log(b), (10.12) 


1 
w i 


log (voy )( 
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where vol(Y) is the sum of node degrees, D is the diagonal degree matrix, w is 
skip-gram’s window size, and b is a constant. As we can see, DeepWalk essentially 
factorizes the log of some high-order normalized adjacency matrices’ sum (up to 
w). To intuitively understand this, we can think of the random walk as extending a 
node’s neighborhood to w hops away, so that we not only require direct neighbors to 
have similar embeddings, but also require nodes reachable from each other through 
w steps of random walk to have similar embeddings. 

Similarly, the LINE algorithm (Tang et al, 2015b) in its second-order forms im- 
plicitly factorizes: 


log (vol(¥)(D~'AD~')) —log(b). (10.13) 


Another popular network embedding method, node2vec, which enhances Deep- 
Walk with negative sampling and biased random walk, is also shown to implicitly 
factorize a matrix. The matrix does not have a closed form due to the use of second- 
order (biased) random walk (Qiu et al, 2018). 


10.2.2.3 Summarization 


We can understand latent-feature methods as extracting low-dimensional node em- 
beddings from the graph structure. Traditional matrix factorization methods use the 
inner product between node embeddings to predict links. However, we are actually 
not restricted to inner product. Instead, we can apply a neural network over an ar- 
bitrary aggregation of pairwise node embeddings to learn link representations. For 
example, node2vec (Grover and Leskovec, 2016) provides four symmetric aggre- 
gation functions (invariant to the order of two nodes): mean, Hadamard product, 
absolute difference, and squared difference. If we predict directed links, we can also 
use non-symmetric aggregation functions, such as concatenation. 

Latent-feature methods can take global properties and long-range effects into 
node representations, because all node pairs are used together to optimize a single 
objective function, and the final embedding learned for a node can be influenced 
by all nodes in the same connected component during the optimization. However, 
latent-feature methods cannot capture structural similarities between nodes Ribeiro 
et al (2017), i.e., two nodes sharing identical neighborhood structures are not 
mapped to similar embeddings. Latent-feature methods also need an extremely large 
dimension to express some simple heuristics (Nickel et al, 2014), making them 
sometimes have worse performance than heuristic methods. Finally, latent-feature 
methods are transductive learning methods—the learned node embeddings cannot 
generalize to new nodes or new networks. 

There are many latent-feature methods designed for heterogeneous graphs. For 
example, the RESCAL model (Nickel et al, 2011) generalizes matrix factorization 
to multi-relation graphs, which essentially performs a kind of tensor factorization. 
Metapath2vec (Dong et al, 2017) generalizes node2vec to heterogeneous graphs. 
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10.2.3 Content-Based Methods 


Both heuristic methods and latent-feature methods face the cold-start problem. That 
is, when a new node joins the network, heuristic methods and latent-feature meth- 
ods may not be able to predict its links accurately because it has no or only a few 
existing links with other nodes. In this case, content-based methods might help. 
Content-based methods leverage explicit content features associated with nodes for 
link prediction, which have wide applications in recommender systems (Lops et al, 
2011). For example, in citation networks, word distributions can be used as content 
features for papers. In social networks, a user’s profile, such as their demographic in- 
formation and interests, can be used as their content features (however, their friend- 
ship information belongs to graph structure features because it is calculated from the 
graph structure). However, content-based methods usually have worse performance 
than heuristic and latent-feature methods due to not leveraging the graph structure. 
Thus, they are usually used together with the other two types of methods (Koren, 
2008; Rendle, 2010; Zhao et al, 2017) to enhance the link prediction performance. 


10.3 GNN Methods for Link Prediction 


In the last section, we have covered three types of traditional link prediction meth- 
ods. In this section, we will talk about GNN methods for link prediction. GNN 
methods combine graph structure features and content features by learning them to- 
gether in a unified way, leveraging the excellent graph representation learning ability 
of GNNs. 

There are mainly two GNN-based link prediction paradigms, node-based and 
subgraph-based. Node-based methods aggregate the pairwise node representations 
learned by a GNN as the link representation. Subgraph-based methods extract a 
local subgraph around each link and use the subgraph representation learned by a 
GNN as the link representation. 


10.3.1 Node-Based Methods 


The most straightforward way of using GNNs for link prediction is to treat GNNs 
as inductive network embedding methods which learn node embeddings from lo- 
cal neighborhood, and then aggregate the pairwise node embeddings of GNNs to 
construct link representations. We call these methods node-based methods. 
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10.3.1.1 Graph AutoEncoder 


The pioneering work of node-based methods is Graph AutoEncoder (GAE) (Kipf 
and Welling, 2016). Given the adjacency matrix A and node feature matrix X of a 
graph, GAE (Kipf and Welling, 2016) first uses a GCN (Kipf and Welling, 2017b) to 
compute a node representation z; for each node i, and then uses o(z] z j) to predict 
link (i, j): 


Aj; = 0(z} zj), where z; = Z;.,Z =GCN(X,A) (10.14) 


where Z is the node representation (embedding) matrix output by the GCN with 
the i row of Z being node i’s representation z;, Âi j 1s the predicted probability for 
link (i, j) and o is the sigmoid function. If X is not given, GAE can use the one- 
hot encoding matrix J instead. The model is trained to minimize the cross entropy 
between the reconstructed adjacency matrix and the true adjacency matrix: 


Z= $, (-AjjlogA;;—(1—Aj,;)log(1 —A;,)). (10.15) 
iEV jEV 


In practice, the loss of positive edges (A; ; = 1) is up-weighted by k, where k is the 
ratio between negative edges (A;,; = 0) and positive edges. The purpose is to balance 
the positive and negative edges’ contribution to the loss. Otherwise, the loss might 
be dominated by negative edges due to the sparsity of practical networks. 


10.3.1.2 Variational Graph AutoEncoder 


The variational version of GAE is called VGAE, or Variational Graph AutoEn- 
coder (Kipf and Welling, 2016). Rather than learning deterministic node embed- 
dings z;, VGAE uses two GCNs to learn the mean p; and variance o? of z;, respec- 
tively. 

VGAE assumes the adjacency matrix A is generated from the latent node embed- 
dings Z through p(A|Z), where Z follows a prior distribution p(Z). Similar to GAE, 
VGAE uses an inner-product-based link reconstruction model as p(A|Z): 


p(A|Z) = II II p(Aj,;|zi,z;), where p(A;; = 1|z;,2;)=o(z;z;). (10.16) 
iE V jEev 


And the prior distribution p(Z) takes a standard Normal distribution: 


= [ | 2z) = [ | ~(z:10,5). (10.17) 


iev igV 


Given p(A|Z) and p(Z), we may compute the posterior distribution of Z using 
Bayes’ rule. However, this distribution is often intractable. Thus, given the adja- 
cency matrix A and node feature matrix X, VGAE uses graph neural networks to 
approximate the posterior distribution of the node embedding matrix Z: 
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q(Z|X,A) = [] q(2i|X,A), where q(z;|X,A) = “~ (zi|Mi,diag(o7)). (10.18) 
iGV 


Here, the mean u; and variance o? of z; are given by two GCNs. Then, VGAE 
maximizes the evidence lower bound to learn the GCN parameters: 


-2 = Eqz|xayllog p(A|Z)| — KL[q(Z|X,A)||p(Z)1, (10.19) 


where KL[q(Z|X ,A)||p(Z)] is the Kullback-Leibler divergence between the approx- 
imated posterior and the prior distribution of Z. The evidence lower bound is opti- 
mized using the reparameterization trick (Kingma and Welling, 2014). Finally, the 
embedding means ju; and ju; are used to predict link (i, j) by Â; j = o(4; py). 


10.3.1.3 Variants of GAE and VGAE 


There are many variants of GAE and VGAE. For example, ARGE (Pan et al, 2018) 
enhances GAE with an adversarial regularization to regularize the node embeddings 
to follow a prior distribution. S-VAE (Davidson et al, 2018) replaces the Normal 
distribution in VGAE with a von Mises-Fisher distribution to model data with a hy- 
perspherical latent structure. MGAE (Wang et al, 2017a) uses a marginalized graph 
autoencoder to reconstruct node features from corrupted ones through a GCN and 
applies it to graph clustering. 

GAE represents a general class of node-based methods, where a GNN is first used 
to learn node embeddings and pairwise node embeddings are aggregated to learn 
link representations. In principle, we can replace the GCN used in GAE/VGAE with 
any GNN, and replace the inner product z} Z j with any aggregation function over 
{zi,z;} and feed the aggregated link representation to an MLP to predict the link 
(i, j). Following this methodology, we can generalize any GNN designed for learn- 
ing node representations to link prediction. For example, HGCN (Chami et al, 2019) 
combines hyperbolic graph convolutional neural networks with a Fermi-Dirac de- 
coder for aggregating pairwise node embeddings and outputting link probabilities: 


p(Ai j = 1\2;,2;) = [exp (d(zi,2;) —r)/t +1]! (10.20) 


where d(-,-) computes the hyperbolic distance and r,t are hyperparameters. 

Position-aware GNN (PGNN) (You et al, 2019) aggregates messages only from 
some selected anchor nodes during the message passing to capture position informa- 
tion of nodes. Then, the inner product between node embeddings are used to predict 
links. The PGNN paper also generalizes other GNNs, including GAT (Petar et al, 
2018), GIN (Xu et al, 2019d) and GraphSAGE (Hamilton et al, 2017b), to the link 
prediction setting based on the inner-product decoder. 

Many graph neural networks use link prediction as an objective for training node 
embeddings in an unsupervised manner, despite that their final task is still node clas- 
sification. For example, after computing the node embeddings, GraphSAGE (Hamil- 
ton et al, 2017b) minimize the following objective for each z; to encourage con- 
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nected or nearby nodes to have similar representations: 


L(z;) = —log (o (z} 2;)) — kn Epp, log (1 — o(z; 2”), (10.21) 


where jis a node co-occurs near i on some fixed-length random walk, p, is the neg- 
ative sampling distribution, and kņ is the number of negative samples. If we focus on 
length-2 random walks, the above loss reduces to a link prediction objective. Com- 
pared to the GAE loss in Equation (10.15), the above objective does not consider all 
© (n) negative links, but uses negative sampling instead to only consider k negative 
pairs (i, j’) for each positive pair (i, j), thus is more suitable for large graphs. 

In the context of recommender systems, there are also many node-based meth- 
ods that can be seen as variants of GAE/VGAE. Monti et al (2017) use GNNs to 
learn user and item embeddings from their respective nearest-neighbor networks, 
and use the inner product between user and item embeddings to predict links. Berg 
et al (2017) propose the graph convolutional matrix completion (GC-MC) model 
which applies a GNN to the user-item bipartite graph to learn user and item embed- 
dings. They use one-hot encoding of node indices as the input node features, and 
use the bilinear product between user and item embeddings to predict links. Spec- 
tralCF (Zheng et al, 201 8a) uses a spectral-GNN on the bipartite graph to learn node 
embeddings. The PinSage model (Ying et al, 2018b) uses node content features as 
the input node features, and uses the GraphSAGE (Hamilton et al, 2017b) model to 
map related items to similar embeddings. 

In the context of knowledge graph completion, R-GCN (Relational Graph Con- 
volutional Neural Network) (Schlichtkrull et al, 2018) is one representative node- 
based method, which considers the relation types by giving different weights to 
different relation types during the message passing. SACN (Structure-Aware Con- 
volutional Network) (Shang et al, 2019) performs message passing for each relation 
type’s induced subgraphs individually and then uses a weighted sum of node em- 
beddings from different relation types. 


10.3.2 Subgraph-Based Methods 


Subgraph-based methods extract a local subgraph around each target link and learn 
a subgraph representation through a GNN for link prediction. 


10.3.2.1 The SEAL Framework 


The pioneering work of subgraph-based methods is SEAL (Zhang and Chen, 
2018b). SEAL first extracts an enclosing subgraph for each target link to predict, 
and then applies a graph-level GNN (with pooling) to classify whether the subgraph 
corresponds to link existence. The enclosing subgraph around a node set is defined 
as follows. 
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Fig. 10.2: Illustration of the SEAL framework. SEAL first extracts enclosing sub- 
graphs around target links to predict. It then applies a node labeling to the enclosing 
subgraphs to differentiate nodes of different roles within a subgraph. Finally, the 
labeled subgraphs are fed into a GNN to learn graph structure features (supervised 
heuristics) for link prediction. 


Definition 10.1. (Enclosing subgraph) For a graph ¥ = (V,&), given a set of 
nodes S C ¥, the h-hop enclosing subgraph for S is the subgraph gh induced from 
F by the set of nodes U jes{i | d(i, j) < h}, where d (i, j) is the shortest path distance 
between nodes i and j. 


In other words, the h-hop enclosing subgraph around a node set S contains nodes 
within A hops of any node in S, as well as all the edges between these nodes. In some 
literature, it is also called h-hop local/rooted subgraph, or h-hop ego network. In link 
prediction tasks, the node set S denotes the two nodes between which to predict a 
link. For example, when predicting the link between x and y, S = {x,y} and ge 
denotes the h-hop enclosing subgraph for link (x,y). l 

The motivation for extracting an enclosing subgraph for each link should be that 
SEAL aims to automatically learn graph structure features from the network. Ob- 
serving that all first-order heuristics can be computed from the 1-hop enclosing sub- 
graph around the target link and all second-order heuristics can be computed from 
the 2-hop enclosing subgraph around the target link, SEAL aims to use a GNN to 
learn general graph structure features (supervised heuristics) from the extracted h- 
hop enclosing subgraphs instead of using predefined heuristics. 

After extracting the enclosing subgraph 
G! the next step is node labeling. SEAL applies a Double Radius Node Label- 
ing (DRNL) to give an integer label to each node in the subgraph as its additional 
feature. The purpose is to use different labels to differentiate nodes of different 
roles in the enclosing subgraph. For instance, the center nodes x and y are the tar- 
get nodes between which the target link is located, thus they are different from the 
rest nodes and should be distinguished. Similarly, nodes at different hops w.r.t. x 
and y may have different structural importance to the link existence, thus can also 
be assigned different labels. As discussed in Section 10.4.2, a proper node labeling 
such as DRNL is crucial for the success of subgraph-based link prediction methods, 
which makes subgraph-based methods have a higher link representation learning 
ability than node-based methods. 
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DRNL works as follows: First, assign label 1 to x and y. Then, for any node i with 
radius (d(i,x),d(i,y)) = (1,1), assign label 2. Nodes with radius (1,2) or (2,1) get 
label 3. Nodes with radius (1,3) or (3,1) get 4. Nodes with (2,2) get 5. Nodes with 
(1,4) or (4,1) get 6. Nodes with (2,3) or (3,2) get 7. So on and so forth. In other 
words, DRNL iteratively assigns larger labels to nodes with a larger radius w.r.t. the 
two center nodes. 

DRNL satisfies the following criteria: 1) The two target nodes x and y always 
have the distinct label “1” so that they can be distinguished from the context nodes. 
2) Nodes i and j have the same label if and only if their “double radius” are the 
same, i.e., i and j have the same distances to (x,y). This way, nodes of the same rel- 
ative positions within the subgraph (described by the double radius (d(i,x),d(i,y))) 
always have the same label. 

DRNL has a closed-form solution for directly mapping (d(i,x),d(i,y)) to labels: 


I(i) = 1+min(d,,dy) + (d/2)[(d/2) + (d%2) — 1], (10.22) 


where dx := d(i,x), dy := d(i,y), d := dx + dy, (d/2) and (d%2) are the integer 
quotient and remainder of d divided by 2, respectively. For nodes with d(i,x) = œ 
or d(i,y) = œ, DRNL gives them a null label 0. 

After getting the DRNL labels, SEAL transforms them into one-hot encoding 
vectors, or feeds them to an embedding layer to get their embeddings. These new 
feature vectors are concatenated with the original node content features (if any) to 
form the new node features. SEAL additionally allows concatenating some pre- 
trained node embeddings such as node2vec embeddings to node features. How- 
ever, as its experimental results show, adding pretrained node embeddings does not 
show clear benefits to the final performance (Zhang and Chen, 2018b). Furthermore, 
adding pretrained node embeddings makes SEAL lose the inductive learning ability. 

Finally, SEAL feeds these enclosing subgraphs as well as their new node feature 
vectors into a graph-level GNN, DGCNN (Zhang et al, 2018g), to learn a graph 
classification function. The groundtruth of each subgraph is whether the two cen- 
ter nodes really have a link. To train this GNN, SEAL randomly samples N exist- 
ing links from the network as positive training links, and samples an equal number 
of unobserved links (random node pairs) as negative training links. After training, 
SEAL applies the trained GNN to new unobserved node pairs’ enclosing subgraphs 
to predict their links. The entire SEAL framework is illustrated in Figure 10.2. 
SEAL achieves strong performance for link prediction, demonstrating consistently 
superior performance than predefined heuristics (Zhang and Chen, 2018b). 


10.3.2.2 Variants of SEAL 


SEAL inspired many follow-up works. For example, Cai and Ji (2020) propose to 
use enclosing subgraphs of different scales to learn scale-invariant models. Li et al 
(2020e) propose Distance Encoding (DE) which generalizes DRNL to node classi- 
fication and general node set classification problems and theoretically analyzes the 
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power it brings to GNNs. The line graph link prediction (LGLP) model (Cai et al, 
2020c) transforms each enclosing subgraph into its line graph and uses the center 
node embedding in the line graph to predict the original link. 

SEAL is also generalized to the bipartite graph link prediction problem of rec- 
ommender systems (Zhang and Chen, 2019). The model is called Inductive Graph- 
based Matrix Completion (IGMC). IGMC also samples an enclosing subgraph 
around each target (user, item) pair, but uses a different node labeling scheme. For 
each enclosing subgraph, it first gives label 0 and label | to the target user and the 
target item, respectively. The remaining nodes’ labels are determined based on both 
their node types and their distances to the target user and item: if a user-type node’s 
shortest path to reach either the target user or the target item has a length k, it will get 
a label 2k; if an item-type node’s shortest path to reach the target user or the target 
item has a length k, it will get a label 2k + 1. This way, the target nodes can always 
be distinguished from the context nodes, and users can be distinguished from items 
(users always have even labels). Furthermore, nodes of different distances to the 
center nodes can be differentiated, too. Finally, the enclosing subgraphs are fed into 
a GNN with R-GCN convolution layers to incorporate the edge type information 
(each edge type corresponds to a different rating). And the output representations 
of the target user and the target item are concatenated as the link representation to 
predict the target rating. IGMC is an inductive matrix completion model without 
relying on any content features, i.e., the model predicts ratings based only on local 
graph structures, and the learned model can transfer to unseen users/items or new 
tasks without retraining. 

In the context of knowledge graph completion, SEAL is generalized to GralL 
(Graph Inductive Learning) (Teru et al, 2020). It also follows the enclosing subgraph 
extraction, node labeling, and GNN prediction framework. For enclosing subgraph 
extraction, it extracts the subgraph induced by all the nodes that occur on at least 
one path of length at most A + 1 between the two target nodes. Unlike SEAL, the 
enclosing subgraph of GraIL does not include those nodes that are only neighbors 
of one target node but are not neighbors of the other target node. This is because for 
knowledge graph reasoning, paths connecting two target nodes are of extra impor- 
tance than dangling nodes. After extracting the enclosing subgraphs, GraIL applies 
DRNL to label the enclosing subgraphs and uses a variant of R-GCN by enhancing 
R-GCN with edge attention to output the score for each link to predict. 


10.3.3 Comparing Node-Based Methods and Subgraph-Based 
Methods 


At first glance, both node-based methods and subgraph-based methods learn graph 
structure features around target links based on a GNN. However, as we will show, 
subgraph-based methods actually have a higher link representation ability than 
node-based methods due to modeling the associations between two target nodes. 
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Fig. 10.3: The different link representation ability between node-based methods and 
subgraph-based methods. In the left graph, nodes v2 and v3 are isomorphic; links 
(v1,¥2) and (v4,v3) are isomorphic; link (vı, v2) and link (v;,v3) are not isomor- 
phic. However, a node-based method cannot differentiate (v;,v2) and (vj,v3). In 
the middle graph, when we predict (v1,v2), we label these two nodes differently 
from the rest, so that a GNN is aware of the target link when learning vı and v2’s 
representations. Similarly, when predicting (v1 ,v3), nodes vı and v3 will be labeled 
differently (shown in the right graph). This way, the representation of v2 in the left 
graph will be different from the representation of v3 in the right graph, enabling 
GNNs to distinguish (v;,v2) and (v1,v3). 


We first use an example to show node-based methods’ limitation for detecting 
associations between two target nodes. Figure 10.3 left shows a graph we want to 
perform link prediction on. In this graph, nodes v2 and v3 are isomorphic (symmetric 
to each other), and links (v1,v2) and (v4,v3) are also isomorphic. However, link 
(v1, v2) and link (v1, v3) are not isomorphic, as they are not symmetric in the graph. 
In fact, vı is much closer to v2 than v3 in the graph, and shares more common 
neighbors with v2. Thus, intuitively we do not want to predict (v,,v2) and (v1,v3) 
the same. However, because v2 and v3 are isomorphic, a node-based method will 
learn the same node representation for v2 and v3 (due to identical neighborhoods). 
Then, because node-based methods aggregate two node representations as a link 
representation, they will learn the same link representation for (v;,v2) and (v1,v3) 
and subsequently output the same link existence probability for them. This is clearly 
not what we want. 

The root cause of this issue is that node-based methods compute two node repre- 
sentations independently of each other, without considering the relative positions 
and associations between the two nodes. For example, although v2 and v3 have dif- 
ferent relative positions w.r.t. vı, a GNN for learning v2 and v3’s representations is 
unaware of this difference by treating v2 and v3 symmetrically. 

With node-based methods, GNNs cannot even learn to count the common 
neighbors between two nodes (which is 1 for (v1,v2) and 0 for (v1,v3)), one of 
the most fundamental graph structure features for link prediction. This is still be- 
cause node-based methods do not consider the other target node when computing 
one target node’s representation. For example, when computing the representation 
of vı, node-based methods do not care about which is the other target node—no 
matter whether the other node has dense connections with it (like v2) or is far away 
from it (like v3), node-based methods will learn the same representation for vı. The 
failure to model the associations between two target nodes sometimes results in bad 
link prediction performance. 
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Different from node-based methods, subgraph-based methods perform link pre- 
diction by extracting an enclosing subgraph around each target link. As we can see, 
if we extract 1-hop enclosing subgraphs for both (v1, v2) and (v1,v3), then they are 
immediately differentiable due to their different enclosing subgraph structures—the 
enclosing subgraph around (vj, v2) is a single connected component, while the en- 
closing subgraph around (v;,v3) is composed of two connected components. Most 
GNNs can easily assign these two subgraphs different representations. 

In addition, the node labeling step in subgraph-based methods also helps model 
the associations between the two target nodes. For example, let us assume we do not 
extract enclosing subgraphs, but only apply a node labeling to the original graph. 
We assume a simplest node labeling which only distinguishes the two target nodes 
from the rest nodes by assigning label 1 to the two target nodes and label 0 to the 
rest nodes (we call it zero-one labeling trick). Then, when we want to predict link 
(v1; v2), we give v1,v2 a different label from those of the rest nodes, as shown by 
different colors in Figure 10.3 middle. With vı and v2 labeled, when a GNN is 
computing v2’s representation, it is also “aware” of the source node vı. And when 
we want to predict link (v1, v3), we will again give v1, v3 a different label, as shown 
in Figure 10.3 right. This way, v2 and v3’s node representations are no longer the 
same in the two differently labeled graphs due to the presence of the labeled v1, 
and we are able to give different predictions to (vj,v2) and (vj,v3). This method 
is called labeling trick (Zhang et al, 2020c). We will discuss it more thoroughly in 
Section 10.4.2. 


10.4 Theory for Link Prediction 


In this section, we will introduce some theoretical developments on GNN-based link 
prediction. For subgraph-based methods, one important motivation is to learn super- 
vised heuristics (graph structure features) from links’ neighborhoods. Then, an im- 
portant question to ask is, how well can GNNs learn existing successful heuristics? 
The y-decaying heuristic theory (Zhang and Chen, 2018b) answers this question. 
In Section 10.3.3, we have seen the limitation of node-based methods for modeling 
the associations and relationships between two target nodes, and we have also seen 
that a simple zero-one node labeling can help solve this problem. Why and how can 
such a simple labeling trick achieve such a better link representation learning abil- 
ity? What are the general requirements for a node labeling scheme to achieve this 
ability? The analysis of labeling trick answers these questions (Zhang et al, 2020c). 


10.4.1 y-Decaying Heuristic Theory 


When using GNNs for link prediction, we want to learn graph structure features 
useful for predicting links based on message passing. However, it is usually not 
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possible to use very deep message passing layers to aggregate information from the 
entire network due to the computation complexity introduced by neighbor explosion 
and the issue of oversmoothing (Li et al, 2018b). This is why node-based methods 
(such as GAE) only use 1 to 3 message passing layers in practice, and why subgraph- 
based methods only extract a small 1-hop or 2-hop local enclosing subgraph around 
each link. 

The y-decaying heuristic theory (Zhang and Chen, 2018b) mainly answers how 
much structural information useful for link prediction is preserved in local neigh- 
borhood of the link, in order to justify applying a GNN only to a local enclos- 
ing subgraph in subgraph-based methods. To answer this question, the y-decaying 
heuristic theory studies how well can existing link prediction heuristics be approxi- 
mated from local enclosing subgraphs. If all these existing successful heuristics can 
be accurately computed or approximated from local enclosing subgraphs, then we 
are more confident to use a GNN to learn general graph structure features from these 
local subgraphs. 


10.4.1.1 Definition of y-Decaying Heuristics 


Firstly, a direct conclusion from the definition of h-hop enclosing subgraphs (Defi- 
nition 10.1) is: 


Proposition 10.1. Any h-order heuristic score for (x,y) can be accurately calcu- 
lated from the h-hop enclosing subgraph KAN around (x,y). 


For example, a 1-hop enclosing subgraph contains all the information needed to 
calculate any first-order heuristics, while a 2-hop enclosing subgraph contains all the 
information needed to calculate any first and second-order heuristics. This indicates 
that first and second-order heuristics can be learned from local enclosing subgraphs 
based on an expressive GNN. However, how about high-order heuristics? High- 
order heuristics usually have better link prediction performance than local ones. To 
study high-order heuristics’ local approximability, the y-decaying heuristic theory 
first defines a general formulation of high-order heuristics, namely the y-decaying 
heuristic. 


Definition 10.2. (y-decaying heuristic) A y-decaying heuristic for link (x,y) has 
the following form: 


H (xy) =N $ y'f(xy,D), (10.23) 
l=1 


where y is a decaying factor between 0 and 1, ņ is a positive constant or a positive 
function of y which is upper bounded by a constant, f is a nonnegative function of 
x,y,/ under the given network, and / can be understood as the iteration number. 


Next, it proves that under certain conditions, any y-decaying heuristic can be 
approximated from an h-hop enclosing subgraph, and the approximation error de- 
creases at least exponentially with h. 
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Theorem 10.1. Given a y-decaying heuristic H (x,y) =N E} Y f(x,y, L), if f(x,y, L) 

satisfies: 

e (property 1) f(x,y,1) < A! where A < z and 

e (property 2) f(x,y,1) is calculable from gr forl =1,2,---,g(h), where g(h) = 
ah+b with a,b € N and a > 0, 


then H (x,y) can be approximated from gh, and the approximation error decreases 
at least exponentially with h. 


Proof. We can approximate such a y-decaying heuristic by summing over its 
first g(h) terms. 
ae g(h) 
Ae E r 0): (10.24) 
Ei 


The approximation error can be bounded as follows. 


co 


=) Vy fexa ye GA aie 


l=g(h)+1 l=ah+b+1 


| 20x, y) 70 (x,y) 


The above proof indicates that a smaller yA leads to a faster decaying speed and a 
smaller approximation error. To approximate a y-decaying heuristic, one just needs 
to sum its first few terms calculable from an h-hop enclosing subgraph. 

Then, a natural question to ask is which existing high-order heuristics belong to 
y-decaying heuristics that allow local approximations. Surprisingly, the y-decaying 
heuristic theory shows that three most popular high-order heuristics: Katz index, 
rooted PageRank and SimRank (listed in Table 10.1) are all y-decaying heuristics 
which satisfy the properties in Theorem 10.1. 

To prove these, we need the following lemma first. 


Lemma 10.1. Any walk between x and y with length | < 2h +1 is included in Gh 


Proof. Given any walk w = (x,v1,--: ,vı—1,y) with length /, we will show 
that every node v; is included in ae Consider any v;. Assume d(vj,x) >h+1 
and d(vj,y) >h+1. Then, 2h+1>1=|(x,v1,---,vi)| +|(vi,-+> ,w-1,9)| = 
d(vj,x) +d(vi,y) > 2h+2, a contradiction. Thus, d(v;,x) < h or d(vi,y) < h. 
By the definition of ge v; must be included in ge 


Next we present the analysis on Katz, rooted PageRank and SimRank. 
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10.4.1.2 Katz index 


The Katz index (Katz, 1953) for (x,y) is defined as 


Katz, = }. B!|walks'” (x,y)| = ¥° B'A'Jxy, (10.25) 
l=1 l=1 


where walks” (x,y) is the set of length-/ walks between x and y, and A! is the /™ 
power of the adjacency matrix of the network. Katz index sums over the collection 
of all walks between x and y where a walk of length / is damped by B! (0 < B < 1), 
giving more weights to shorter walks. 

Katz index is directly defined in the form of a y-decaying heuristic with 7 = 
1,y=B, and f(x,y,1) = |walks"”) (x,y) |. According to Lemma 10.1, |walks"? (x,y)| 
is calculable from gh, for 1 < 2h + 1, thus property 2 in Theorem 10.1 is satisfied. 
Now we show when property 1 is satisfied. 


Proposition 10.2. For any nodes i, j, [A'];; is bounded by d!, where d is the maxi- 
mum node degree of the network. 


Proof. We prove it by induction. When / = 1, A; j < d for any (i, j). Thus the 
base case is correct. Now, assume by induction that [A’]; ; < d! for any (i, j), 
we have 


= 


il 
(At, Z [A"]ikAk,j < d! P Arj =Z d'd Sa 
k=1 k=1 


Taking A = d, we can see that whenever d < 1/, the Katz index will satisfy 
property 1 in Theorem 10.1. In practice, the damping factor B is often set to very 
small values like SE-4 (Liben-Nowell and Kleinberg, 2007), which implies that Katz 
can be very well approximated from the h-hop enclosing subgraph. 


10.4.1.3 PageRank 


The rooted PageRank for node x calculates the stationary distribution of a random 
walker starting at x, who iteratively moves to a random neighbor of its current po- 
sition with probability œ or returns to x with probability 1 — œ. Let 2, denote the 
stationary distribution vector. Let [7]; denote the probability that the random walker 
is at node i under the stationary distribution. 

1 


Let P be the transition matrix with P; ; = Fo if (i, j) € E and P; ; = 0 otherwise. 
F 


Let e, be a vector with the x™ element being 1 and others being 0. The stationary 
distribution satisfies 
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Ty = APN, +(1—a)e,. (10.26) 


When used for link prediction, the score for (x,y) is given by [%]y (or [Tx]y + 
[z,] for symmetry). To show that rooted PageRank is a y-decaying heuristic, we 
introduce the inverse P-distance theory (Jeh and Widom, 2003), which states that 
[7x]y can be equivalently written as follows: 


[Rly =(1—ar) X Placer), (10.27) 


wiry 


where the summation is taken over all walks w starting at x and ending at y (pos- 
sibly touching x and y multiple times). For a walk w = (vo,v1,--- , vg), len(w) := 
|(vo,V1,-** ,vx)| is the length of the walk. The term P[w] is defined as []‘-} To 
which can be interpreted as the probability of traveling w. Now we have the follow- 


ing theorem. 


Theorem 10.2. The rooted PageRank heuristic is a y-decaying heuristic which sat- 
isfies the properties in Theorem 10.1. 


Proof. We first write [7x]; in the following form. 


[nly = (1a) L Pwo. (10.28) 
l=1 wix~y 
len(w)=1 


Defining f(x, y,/) :=¥ w:x~y P|w] leads to the form of a y-decaying heuristic. 
len(w)=/ 


Note that f(x, y,/) is the probability that a random walker starting at x stops at 
y with exactly / steps, which satisfies ).-y f (x,z,1) = 1. Thus, f(x,y,1) <1< 
1 . . 

x (property 1). According to Lemma 10.1, f(x,y,/) is also calculable from 
De for l < 2h-+ 1 (property 2). 


10.4.1.4 SimRank 


The SimRank score (Jeh and Widom, 2002) is motivated by the intuition that two 
nodes are similar if their neighbors are also similar. It is defined in the following 
recursive way: if x = y, then s(x,y) := 1; otherwise, 


Laer (x) Yoer(y) s(a,b) 
IT) IPO) 


where y is a constant between 0 and 1. According to (Jeh and Widom, 2002), Sim- 
Rank has an equivalent definition: 


s(x,y) =¥ (10.29) 
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sy)= E Pwy, (10.30) 


w:(x,y)—0(z,2) 


where w : (x,y) — (z,z) denotes all simultaneous walks such that one walk starts at 
x, the other walk starts at y, and they first meet at any vertex z. For a simultaneous 
walk w = ( (vo, uo), +-+- , (vk, Ux)), len(w) = k is the length of the walk. The term P[w] 
is similarly defined as J]; TOTE describing the probability of this walk. Now 


we have the following theorem. 


Theorem 10.3. SimRank is a y-decaying heuristic which satisfies the properties in 
Theorem 10.1. 


Proof. We write s(x,y) as follows. 


Plwiy', (10.31) 


w:(x,y)—°(z,z) 
len(w)=1 


s(x,y) = 


TM: 


Defining f(x,y,l) := Lw:(x,»)-0(z,z) P[w] reveals that SimRank is a y-decaying 
len(w)=1 
heuristic. Note that f(x,y,l) < 1 < a It is easy to see that f(x,y,/) is also 


calculable from a for] <h. 


10.4.1.5 Discussion 


There exist several other high-order heuristics based on path counting or random 
walk (Lii and Zhou, 2011) which can be as well incorporated into the y-decaying 
heuristic framework. Another interesting finding is that first and second-order 
heuristics can be unified into this framework too. For example, common neighbors 
can be seen as a y-decaying heuristic with n = y=1, and f(x,y,/) = | (x) Nnr) 
for / = 1, f(x,y,/) = 0 otherwise. 

The above results reveal that most existing link prediction heuristics inherently 
share the same y-decaying heuristic form, and thus can be effectively approximated 
from an h-hop enclosing subgraph with exponentially smaller approximation er- 
ror. The ubiquity of y-decaying heuristics is not by accident—it implies that a suc- 
cessful link prediction heuristic is better to put exponentially smaller weight on 
structures far away from the target, as remote parts of the network intuitively make 
little contribution to link existence. The y-decaying heuristic theory builds the foun- 
dation for learning supervised heuristics from local enclosing subgraphs, as they 
imply that local enclosing subgraphs already contain enough information to learn 
good graph structure features for link prediction which is much desired considering 
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learning from the entire network is often infeasible. This motivates the proposition 
of subgraph-based methods. 

To summarize, from small enclosing subgraphs extracted around links, we are 
able to accurately calculate first and second-order heuristics, and approximate a 
wide range of high-order heuristics with small errors. Therefore, given a sufficiently 
expressive GNN, learning from such enclosing subgraphs is expected to achieve 
performance at least as good as a wide range of heuristics. 


10.4.2 Labeling Trick 


In Section 10.3.3, we have briefly discussed the difference between node-based 
methods’ and subgraph-based methods’ link representation learning abilities. This 
is formalized into the analysis of labeling trick (Zhang et al, 2020c). 


10.4.2.1 Structural Representation 


We first introduce some preliminary knowledge on structural representation, which 
is a core concept in the analysis of labeling trick. 

We define a graph to be Y = (V,&,A), where Y = {1,2,...,n} is the set of 
n vertices, £ C V x Y is the set of edges, and A € R"*”** is a 3-dimensional 
tensor (we call it adjacency tensor) containing node and edge features. The diagonal 
components A;;; denote features of node i, and the off-diagonal components A; ;.: 
denote features of edge (i, j). We further use A € {0, 1}”*” to denote the adjacency 
matrix of Y with A; j = 1 iff (i, j) € E. If there are no node/edge features, we let 
A =A. Otherwise, A can be regarded as the first slice of A, i.e., A = A. 1. 

A permutation T is a bijective mapping from {1,2,...,n} to {1,2,...,n}. De- 
pending on the context, (i) can mean assigning a new index to node i € V, or 
mapping node i to node 2(i) of another graph. All n! possible 7’s constitute the 
permutation group IT,,. For joint prediction tasks over a set of nodes, we use S to 
denote the target node set. For example, S = {i,j} if we want to predict the link 
between i, j. We define 2(S) = {2(i)|i € S}. We further define the permutation of A 
as m(A), where ™(A) x(i),a( i), = Åi, j 

Next, we define set isomorphism, which generalizes graph isomorphism to arbi- 
trary node sets. 


Definition 10.3. (Set isomorphism) Given two n-node graphs ¥ = (V,&,A), GY = 
(V',&"',A’), and two node sets SC Y, S' C VW’, we say (S,A) and (S’, A’) are isomor- 
phic (denoted by (S, A) œ (S’,A’)) if Sa € I, such that S = 2(S’) and A= a(A’). 


When (¥,A) ~ (V',A’), we say two graphs Y and 4 are isomorphic (abbreviated 
as A ~ A’ because V = 2(¥’) for any 2). Note that set isomorphism is more strict 
than graph isomorphism, because it not only requires graph isomorphism, but also 
requires that the permutation maps a specific node set S to another node set S’. 
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In practice, when S 4 Y, we are often more concerned with the case of A =A’, 
where we are to find isomorphic node sets in the same graph (automorphism). 
For example, when S = {i},S’ = {j} and (i,A) œ (j,A), we say nodes i and j are 
isomorphic in graph A (or they have symmetric positions/same structural role within 
the graph). An example is vz and v3 in Figure 10.3 left. 

We say a function f defined over the space of (S,A) is permutation invariant 
(or invariant for abbreviation) if Va € IIn, f(S,A) = f(a(S),2(A)). Similarly, f is 
permutation equivariant if Yn € Tn, m(f(S,A)) = f (a(S), a(A)). 

Now we define structural representation of a node set, following (Srinivasan and 
Ribeiro, 2020b; Li et al, 2020e). It assigns a unique representation to each equiva- 
lence class of isomorphic node sets. 


Definition 10.4. (Most expressive structural representation) Given an invariant 
function I"(-), (S,A) is a most expressive structural representation for (S,A) if 
VS,A,S',A’, (S,A) =T (S',A’) & (S,A) & (S,A’). 


For simplicity, we will briefly use structural representation to denote most expres- 
sive structural representation in the rest of this section. We will omit A if it is 
clear from context. We call I (i, A) a structural node representation for i, and call 
T ({i, 7}, A) a structural link representation for (i, j). 

Definition 10.4 requires the structural representations of two node sets to be the 
same if and only if they are isomorphic. That is, isomorphic node sets always have 
the same structural representation, while non-isomorphic node sets always have 
different structural representations. This is in contrast to positional node embed- 
dings such as DeepWalk (Perozzi et al, 2014) and matrix factorization (Mnih and 
Salakhutdinov, 2008), where two isomorphic nodes can have different node embed- 
dings (Ribeiro et al, 2017). 

So why do we need structural representations? Formally speaking, Srinivasan 
and Ribeiro (2020b) prove that any joint prediction task over node sets only requires 
most-expressive structural representations of node sets, which are the same for two 
node sets if and only if these two node sets are isomorphic. This means, for link pre- 
diction tasks, we need to learn the same representation for isomorphic links while 
discriminating non-isomorphic links by giving them different representations. Intu- 
itively speaking, two links being isomorphic means they should be indistinguishable 
from any perspective—if one link exists, the other should exist too, and vice versa. 
Therefore, link prediction ultimately requires finding such a structural link repre- 
sentation for node pairs which can uniquely identify link isomorphism classes. 

According to Figure 10.3 left, node-based methods that directly aggregate two 
node representations cannot learn such a valid structural link representation because 
they cannot differentiate non-isomorphic links such as (v;, v2) and (v1,v3). One may 
wonder whether using one-hot encoding of node indices as the input node features 
help node-based methods learn such a structural link representation. Indeed, using 
node-discriminating features enables node-based methods to learn different repre- 
sentations for (v1,v2) and (v1,v3) in Figure 10.3 left. However, it also loses GNN’s 
ability to map isomorphic nodes (such as v2 and v3) and isomorphic links (such 
as (v1,v2) and (v4,v3)) to the same representations, since any two nodes already 
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have different representations from the beginning. This might result in poor gener- 
alization ability—two nodes/links may have different final representations even they 
share identical neighborhoods. 

To ease our analysis, we also define a node-most-expressive GNN, which gives 
different representations to all non-isomorphic nodes and gives the same represen- 
tation to all isomorphic nodes. In other words, a node-most-expressive GNN learns 
structural node representations. 


Definition 10.5. (Node-most-expressive GNN) A GNN is node-most-expressive if 
it satisfies: Vi,A,j,A’, GNN(i,A) = GNN(j,A’) & (i, A) œ (j,A’). 


Although a polynomial-time implementation of a node-most-expressive GNN is not 
known, practical GNNs based on message passing can still discriminate almost all 
non-isomorphic nodes (Babai and Kucera, 1979), thus well approximating its power. 


10.4.2.2 Labeling Trick Enables Learning Structural Representations 


Now, we are ready to introduce the labeling trick and see how it enables learning 
structural representations of node sets. As we have seen in Section 10.4.2, a simple 
zero-one labeling trick can help a GNN distinguish non-isomorphic links such as 
(v1,v2) and (v1,v3) in Figure 10.3 left. At the same time, isomorphic links, such 
as (v1,v2) and (v4,v3), will still have the same representation, since the zero-one 
labeled graph for (v1, v2) is still symmetric to the zero-one labeled graph for (v4, v3). 
This brings an exclusive advantage over using one-hot encoding of node indices. 

Below we give the formal definition of labeling trick, which incorporates the 
zero-one labeling trick as one specific form. 


Definition 10.6. (Labeling trick) Given (S,A), we stack a labeling tensor LS) € 
R”*"*¢ in the third dimension of A to get anew AUS) € IRIxnx (k+d) | where L satis- 
fies: VS,A,S’,A’, € I, (1) LS) = (L°) > S = a(S’), and (2) S = 2(S'), A= 
n(A) + L() = a(L). 


To explain a bit, labeling trick assigns a label vector to each node/edge in graph 
A, which constitutes the labeling tensor L‘S), By concatenating A and LS), we get 
the adjacency tensor A\S) of the new labeled graph. By definition we can assign 
labels to both nodes and edges. For simplicity, here we only consider node labels, 
(S) 
ij: 

The labeling tensor L(S) should satisfy two conditions in Definition 10.6. The 
first condition requires the target nodes S to have distinct labels from those of the 
rest nodes, so that S is distinguishable from others. This is because if a permutation 
7 preserving node labels exists between nodes of A and A’, then S and S’ must have 
distinct labels to guarantee S’ is mapped to S by m. The second condition requires 
the labeling function to be permutation equivariant, i.e., when (S,A) and (S’,A‘) are 
isomorphic under 7, the corresponding nodes i € S, j € S’,i= (j) must always have 
the same label. In other words, the labeling should be consistent across different S. 


i.e., we let off-diagonal components L;’’. be all zero. 


220 Muhan Zhang 


For example, the zero-one labeling is a valid labeling trick by always giving label 1 
to nodes in S and 0 otherwise, which is both consistent and S-discriminating. How- 
ever, an all-one labeling is not a valid labeling trick, because it cannot distinguish 
the target set S. 

Now we introduce the main theorem of labeling trick showing that with a valid 
labeling trick, a node-most-expressive GNN can learn structural link representations 
by aggregating its node representations learned from the labeled graph. 


Theorem 10.4. Given a node-most-expressive GNN and an injective set aggrega- 
tion function AGG, for any S,A,S',A’, GNN(S,A‘)) = GNN(S’, A/S) & (S,A) ~ 
(S', A’), where GNN(S, A‘)) := AGG({GNN(i, A®))|i € S}). 


The proof of the above theorem can be found in Appendix A of (Zhang et al, 2020c). 
Theorem 10.4 implies that AGG({GNN(i,A“))|i € S}) is a structural represen- 
tation for (S,A). Remember that directly aggregating structural node representa- 
tions learned from the original graph A does not lead to structural link representa- 
tions. Theorem 10.4 shows that aggregating over the structural node representations 
learned from the adjacency tensor A‘) of the labeled graph, somewhat surprisingly, 
results in a structural representation for S. 

The significance of Theorem 10.4 is that it closes the gap between GNN’s node 
representation nature and link prediction’s link representation requirement, which 
solves the open question raised in (Srinivasan and Ribeiro, 2020b) questioning 
node-based GNN methods’ ability of performing link prediction. Although directly 
aggregating pairwise node representations learned by GNNs does not lead to struc- 
tural link representations, combining GNNs with a labeling trick enables learning 
structural link representations. 

It can be easily proved that the zero-one labeling, DRNL and Distance Encod- 
ing (DE) (Li et al, 2020e) are all valid labeling tricks. This explains subgraph- 
based methods’ superior empirical performance than node-based methods (Zhang 
and Chen, 2018b; Zhang et al, 2020c). 


10.5 Future Directions 


In this section, we introduce several important future directions for link prediction: 
accelerating subgraph-based methods, designing more powerful labeling tricks, and 
understanding when to use one-hot features. 


10.5.1 Accelerating Subgraph-Based Methods 


One important future direction is to accelerate subgraph-based methods. Although 
subgraph-based methods show superior performance than node-based methods both 
empirically and theoretically, they also suffer from a huge computation complexity, 
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which prevent them from being deployed in modern recommendation systems. How 
to accelerate subgraph-based methods is thus an important problem to study. 

The extra computation complexity of subgraph-based methods comes from their 
node labeling step. The reason is that for every link (i,j) to predict, we need to 
relabel the graph according to (i, j). The same node v will be labeled differently 
depending on which one is the target link, and will be given a different node rep- 
resentation by the GNN when it appears in different links’ labeled graphs. This is 
different from node-based methods, where we do not relabel the graph and each 
node only has a single representation. 

In other words, for node-based methods, we only need to apply the GNN to 
the whole graph once to compute a representation for each node, while subgraph- 
based methods need to repeatedly apply the GNN to differently labeled subgraphs 
each corresponding to a different link. Thus, when computing link representations, 
subgraph-based methods require re-applying the GNN for each target link. For a 
graph with n nodes and m links to predict, node-based methods only need to apply 
a GNN @(n) times to get a representation for each node (and then use some sim- 
ple aggregation function to get link representations), while subgraph-based methods 
need to apply a GNN @(m) times for all links. When m >> n, subgraph-based meth- 
ods have much worse time complexity than node-based methods, which is the price 
for learning more expressive link representations. 

Is it possible to accelerate subgraph-based methods? One possible way is to sim- 
plify the enclosing subgraph extraction process and simplify the GNN architecture. 
For example, we may adopt sampling or random walk when extracting the enclosing 
subgraphs which might largely reduce the subgraph sizes and avoid hub nodes. It is 
interesting to study such simplifications’ influence on performance. Another possi- 
ble way is to use distributed and parallel computing techniques. The enclosing sub- 
graph extraction process and the GNN computation on a subgraph are completely 
independent of each other and are naturally parallelizable. Finally, using multi-stage 
ranking techniques could also help. Multi-stage ranking will first use some simple 
methods (such as traditional heuristics) to filter out most unlikely links, and use 
more powerful methods (such as SEAL) in the later stage to only rank the most 
promising links and output the final recommendations/predictions. 

Either way, solving the scalability issue of subgraph-based methods can be a 
great contribution to the field. That means we can enjoy the superior link prediction 
performance of subgraph-based GNN methods without using much more computa- 
tion resources, which is expected to extend GNNs to more application domains. 


10.5.2 Designing More Powerful Labeling Tricks 


Another direction is to design more powerful labeling tricks. Definition 10.6 gives 
a general definition of labeling trick. Although any labeling trick satisfying Defi- 
nition 10.6 can enable a node-most-expressive GNN to learn structural link repre- 
sentations, the real-world performance of different labeling tricks can vary a lot due 
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to the limited expressive power and depths of practical GNNs. Also, some subtle 
differences in implementing a labeling trick can also result in large performance 
differences. For example, given the two target nodes x and y, when computing the 
distance d(i,x) from a node i to x, DRNL will temporarily mask node y and all its 
edges, and when computing the distance d(i,y), DRNL will temporarily mask node 
x and all its edges (Zhang and Chen, 2018b). The reason for this “masking trick” is 
that DRNL aims to use the pure distance between i and x without the influence of 
y. If we do not mask y, d(i,x) will be upper bounded by d(i,y) + d(x,y), which ob- 
scures the “true distance” between i and x and might hurt the node labels’ ability to 
discriminate structurally-different nodes. As shown in Appendix H of (Zhang et al, 
2020c), this masking trick can greatly improve the performance. It is thus interest- 
ing to study how to design a more powerful labeling trick (not necessarily based on 
shortest path distance like DRNL and DE). It should not only distinguish the target 
nodes, but also assign diverse but generalizable labels to nodes with different roles 
in the subgraph. A further theoretical analysis on the power of different labeling 
tricks is also needed. 


10.5.3 Understanding When to Use One-Hot Features 


Finally, one last important question remaining to be answered is when we should 
use the original node features and when we should use one-hot encoding features of 
node indices. Although using one-hot features makes it infeasible to learn structural 
link representations as discussed in Section 10.4.2, node-based methods using one- 
hot features show strong performance on dense networks (Zhang et al, 2020c), out- 
performing subgraph-based methods without using one-hot features by large mar- 
gins. On the other hand, Kipf and Welling (2017b) show that GAE/VGAE with 
one-hot features gives worse performance than using original features. Thus, it is 
interesting to study when to use one-hot features and when to use original features 
and theoretically understand their representation power differences on networks of 
different properties. Srinivasan and Ribeiro (2020b) provide a good analysis con- 
necting positional node embeddings (such as Deep Walk) with structural node repre- 
sentations, showing that positional node embeddings can be seen as a sample while 
the structural node representation can be seen as a distribution. This can serve as 
a Starting point to study the power of GNNs using one-hot encoding features, as 
GNNs using one-hot encoding features can be seen as combining positional node 
embeddings with message passing. 
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Editor’s Notes: Link prediction is the problem of predicting the existence 
of a link between two nodes in a network. Hence the techniques are rele- 
vant to graph structure learning (chapter 19), which aims to discover useful 
graph structure, i.e. links, from data. Scalability property (chapter 6) and 
expressiveness power theory (chapter 8) play an important role in apply- 
ing and designing link prediction methods. Link prediction also motivates 
several downstream tasks in various domains, such as predicting protein- 
protein and protein-drug interactions (chapter 25), drug development (chap- 
ter 24), recommender systems (chapter 19). Besides, predicting links in 
the complex network, including dynamic graphs (chapter 19), knowledge 
graphs (chapter 24) and heterogeneous graphs (chapter 26), are also the ex- 
tension of link prediction tasks. 
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Chapter 11 N 
Graph Neural Networks: Graph Generation 


Renjie Liao 


Abstract In this chapter, we first review a few classic probabilistic models for graph 
generation including the Erdős—-Rényi model and the stochastic block model. Then 
we introduce several representative modern graph generative models that lever- 
age deep learning techniques like graph neural networks, variational auto-encoders, 
deep auto-regressive models, and generative adversarial networks. At last, we con- 
clude the chapter with a discussion on potential future directions. 


11.1 Introduction 


The study of graph generation revolves around building probabilistic models over 
graphs which are also called networks in many scientific disciplines. This problem 
has its roots in a branch of mathematics, called random graph theory (Bollobás, 
2013), which largely lies at the intersection between the probability theory and the 
graph theory. It is also at the core of a new academic field, called network sci- 
ence (Barabási, 2013). Historically, researchers in these fields are often interested in 
building random graph models (i.e., constructing distributions of graphs using cer- 
tain parametric families of distributions) and proving the mathematical properties 
of such models. Albeit being an extremely fruitful and successful research direction 
that spawns numerous outcomes, these classic models suffer from being too sim- 
plistic to capture the complex phenomenon (e.g., highly-clustered, well-connected, 
scale-free) that appeared in the real-world graphs. 

With the advent of powerful deep learning techniques like graph neural net- 
works, we can build more expressive probabilistic models of graphs, i.e., the so- 
called deep graph generative models. Such deep models can better capture the com- 
plex dependencies within the graph data to generate more realistic graphs and fur- 
ther build accurate predictive models. However, the downside is that these models 
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are often so complicated that we can rarely analyze their properties in a precise 
manner. The recent practices of these models have demonstrated impressive per- 
formances in modeling real-world graphs/networks, e.g., social networks, citation 
networks, and molecule graphs. 

In the following, we first introduce the classic graph generative models in Section 
11.2 and then the modern ones that leverage the deep learning techniques in Section 
11.3. At last, we conclude the chapter and discuss some promising future directions. 


11.2 Classic Graph Generative Models 


In this section, we review two popular variants of the classic graph generative mod- 
els: the Erd6s—Rényi model (Erdős and Rényi, 1960) and the stochastic block model 
(Holland et al, 1983). They are often used as handy baselines in many applications 
since we have already gained deep understandings of their properties. There are 
many other graph generative models like the Watts—Strogatz small-world model 
(Watts and Strogatz, 1998) and the Barabasi—Albert (BA) preferential attachment 
model (Barabasi and Albert, 1999). Barabási (2013) provides a thorough survey 
on these models and other aspects of network science. In the context of machine 
learning, there are also quite a few non-deep-learning graph generative models like 
Kronecker graphs (Leskovec et al, 2010). We do not cover these models due to the 
space limit. 


11.2.1 Erdés—Rényi Model 


We first explain one of the most well known random graph models, i.e., Erd6s—Rényi 
model (Erdős and Rényi, 1960), named after two Hungarian mathematicians Paul 
Erdős and Alfréd Rényi. Note that this model has been independently proposed at 
around the same time by Edgar Gilbert in (Gilbert, 1959). In the following, we first 
describe the model along with its properties and then discuss its limitations. 


11.2.1.1 Model 


The Erdés—Rényi model has two closely variants, namely, G(n, p) and G(n,m). 
G(n,p) Model In the G(n, p) model, we are given n labeled nodes and generate 
a graph by randomly connecting an edge linking one node to the other with the 
probability p, independently from every other edge. In other words, all (5) possible 
edges have the equal probability p to be included. Therefore, the probability of 


generating a graph with m edges under this model is as below, 


p(a graph with n nodes and m edges) = p” (1 -p O, (11.1) 
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The parameter p controls the “density” of the graph, i.e., a larger value of p makes 
the graph become more likely to contain more edges. When p = 5 the above prob- 


ability becomes 16), i.e., all possible 2(3) graphs are chosen with equal probability. 
Due to the independence of the edges in G(n,p), we can easily derive a few 
properties from this model. 


e The expected number of edges is (5) p. 
¢ The degree distribution of any node v is binomial: 


n C n—1|— 
p(degree(v) = k) = (ota) iak (11.2) 
e If Np is a constant and n — , the degree distribution of any node v is Poisson: 


(np)ie"” 


p(degree(v) =k) = il (11.3) 
There is an enormous number of more involved properties of this model that has 
been proved (e.g., by Erdés and Rényi in the original paper). We list a few others as 
below. 


a then a graph will almost surely be connected. 


eIfp> 

e Ifp< Urena then a graph will almost surely contain isolated vertices, and 
thus be disconnected. 

e If Np < 1, then a graph will almost surely have no connected components of 


size larger than O(log(n)). 


Here almost surely means the probability of the event happens with probability 1 
(i.e., the set of possible exceptions has zero measure). 

G(n,m) Model In the G(n,m) model, we are given n labeled nodes and generate 
a graph by uniformly randomly choosing a graph from the set of all graphs with n 


nodes and m edges, i.e., the probability of choosing each graph is ((2)) p There are 
also many important properties associated with the G(n,m) model. In particular, it 
is interchangeable with the G(n, p) model provided that m is close to (5) p in most 
investigations. Chapter 2 of (Bollobas and Béla, 2001) provides a comprehensive 
discussion on the relationship between these two models. The G(n, p) model is more 
commonly used in practice than the G(n,m) model, partly due to the ease of analysis 
brought by the independence of the edges. 


11.2.1.2 Discussion 


As a seminal work in the random graph theory, the Erdés—Rényi model inspires 
much subsequent work to study and generalize this model. However, the assump- 
tions of this model, e.g., edges are independent and each edge is equally likely to 
be generated, are too strong to capture the properties of the real-world graphs. For 
example, the degree distribution of the Erdés—Rényi model has an exponential tail 
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which means we rarely see node degrees span a broad range, e.g., several orders 
of magnitude. Meanwhile, real-world graphs/networks like the World Wide Web 
(WWW) are believed to possess a degree distribution that follows a power law, i.e., 
p(d) œ~ d7” where d is the degree and the exponent y is typically between 2 and 
3. Essentially, this means that there are many nodes that have small node degrees, 
whereas there are a few nodes which have extremely large node degrees (, hubs) in 
the real-world graphs like WWW. Therefore, many improved models like the scale- 
free networks (Barabdsi and Albert, 1999) were later proposed, which fit better to 
the degree distribution of the real-world graphs. 


11.2.2 Stochastic Block Model 


Stochastic block models (SBM) are a family of random graphs with clusters of nodes 
and are often employed as a canonical model for tasks like community detection 
and clustering. It is proposed independently in a few scientific communities, e.g., 
machine learning and statistics (Holland et al, 1983), theoretical computer science 
(Bui et al, 1987), and mathematics (Bollobas et al, 2007). It is arguably the simplest 
model of a graph with communities/clusters. As a generative model, SBM could 
provide ground-truth cluster memberships, which in turn could help benchmark and 
understand different clustering/community detection algorithms. In the following, 
we first introduce the basics of the model and then discuss its advantages as well as 
limitations. 


11.2.2.1 Model 


We start the introduction by denoting the total number of nodes as n and the number 
of communities/clusters as k. A prior probability vector p over the k clusters and 
ak xk matrix W with entries in [0,1] are also given. We generate a random graph 
following the procedure below: 


1. For each node, we generate its community label (an integer from {1,--- ,k}) by 
independently sampling from p. 

2. For each pair of nodes, denoting their community labels as i and j, we generate 
an edge by independently sampling with probability W; j. 


Basically, the community assignments of a pair of nodes determine the specific en- 
try of W to be used, which in turn indicates how likely we connect this pair of nodes. 
We denote such a model as SBM(n, p, W). Note that, if we set W; j; = q for all com- 
munities (i, j), then the corresponding SBM degenerates to the Erdés—Rényi model 
G(n,q). 

In the context of community detection, people are often interested in recovering 
the community label given a random graph drawn from the SBM model. Denoting 
the recovered and the ground-truth community labels as X € R”*! and Y € R™!, 
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we can define the agreement R between two community labels as, 
R(X,Y) =max au (11.4) 


where P is a permutation matrix and IT is the set of all permutation matrices. X; and 
(PY); are the i-th element of X and PY respectively. In short, the agreement consid- 
ers the best possible reshuffle between two sequences of labels. Depending on the 
requirement, we could examine the community detection algorithms in the sense 
of exact recovery (i.e., cluster assignments are exactly recovered almost surely, 
p(R(X,Y) = 1) = 1) or partial recovery (i.e., at most 1 — € fraction of nodes are 
mislabeled almost surely, p(R(X,Y) > €) = 1). Researchers have established vari- 
ous conditions under which a particular type of recovery is possible for SBM graphs. 
For example, for SBMs with W = ete where Q is a matrix with positive entries 
and the same size as W, Abbe and Sandon (2015) shows that the exact recovery 
is possible if and only if the minimum Chernoff-Hellinger divergence between any 
two columns of diag(p)@Q is no less than 1, where diag(p) is a diagonal matrix with 
diagonal entries as p. 


11.2.2.2 Discussion 


Abbe (2017) provides an up-to-date and comprehensive survey on the SBM and 
the fundamental limits (from both information-theoretic and computational per- 
spectives) for community detection in the SBM. SBM is a more realistic random 
graph model for describing graphs with community structures compared to the 
Erd6s—Rényi model. It also spawns many subsequent variants of block models like 
the mixed membership SBM (Airoldi et al, 2008). However, the estimation of SBMs 
on real-world graphs is hard since the number of communities is often unknown in 
advance and some graphs may not exhibit clear community structures. 


11.3 Deep Graph Generative Models 


In this section, we review several representative deep graph generative models which 
aim at building probabilistic models of graphs using deep neural networks. Based 
on the type of deep learning techniques being used, we can roughly divide the cur- 
rent literature into three categories: variational autoencoder (VAEs) (Kingma and 
Welling, 2014) based methods, deep auto-regressive (Van Oord et al, 2016) meth- 
ods, and generative adversarial networks (GANs) (Goodfellow et al, 2014b) based 
methods. We introduce all three model classes in the subsequent sections. 
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11.3.1 Representing Graphs 


We first introduce how a graph is represented in the context of deep graph generative 
models. Suppose we are given a graph Y = (V, £) where ¥ is the set of nodes/ver- 
tices and & is the set of edges. Conditioning on a specific node ordering 7, we can 
represent the graph Y as an adjacency matrix Az where Az € R'”|*|”1, where |V| is 
the size of set Y (i.e., the number of nodes). The adjacency matrix not only provides 
a convenient representation of graphs on computers but also offers a natural way 
to mathematically define a probability distribution over graphs. Here we explicitly 
write the node ordering z in the subscript to emphasize that the rows and columns 
of A are arranged according to the 7. If we change the node ordering from 7 to 7’, 
the adjacency matrix will be permuted (shuffling rows and columns) accordingly, 
i.e., Aw = PA;P', where the permutation matrix P is constructed based on the pair 
of node orderings (z,2'). In other words, Az and A, represent the same graph Z. 
Therefore, a graph ¥ with an adjacency matrix Az can be equivalently represented 
as a set of adjacency matrices {PA;P' |P € IT} where IT is the set of all permutation 
matrices with size |V| x |¥|. Note that, depending on the symmetric structures of 
Az, there may exist two permutation matrices P , P> € IT so that P\Aq Al = PrAgPy ; 
Therefore, we remove such redundancies and keep those uniquely permutated ad- 
jacency matrices, denoted as </ = {PA;P'|P € Ig}. More precisely, Tg is the 
maximal subset of II so that Pi\AgP| Æ PrAgPy holds for any P|, P> € Hg. We 
add the subscript Y to emphasize that Ty depends on the given graph Z. Note that 
there exists a surjective mapping between IT and Hg. For the ease of notations, we 
will drop the subscript of the node ordering and use Y = æ% = {PAP'|P € Ty} to 
represent a graph from now on. 

When considering the node features/attributes X, we can denote the graph struc- 
tured data as Y = {(PAP',PX)|P € Ig}!. Note that the rows of X are shuffled 
according to P since each row of X corresponds to a node. In our context, we can 
assume the maximum number of nodes of all graphs is n. If a graph has fewer nodes 
than n, we can add dummy nodes (e.g., with all-zero features) which are isolated to 
other nodes to make the size equal n. Therefore, X € R’*“* and A € R”*” where 
dy is the feature dimension. To simplify the explanation, we do not include the 
edge feature. But it is straightforward to modify the following models accordingly 
to incorporate edge features. 


11.3.2 Variational Auto-Encoder Methods 


Due to the great success of VAEs in image generation (Kingma and Welling, 2014; 
Rezende et al, 2014), it is natural to extend this framework to graph generation. This 


l Technically, there may exist two permutation matrices Pı, P) € I so that P AP} = PAP} and 
PiX # PX. It thus seems to be necessary to define Y = {(PAP' ,PX)|P € IT}. However, as seen 
later, we are always interested in distributions of node features that are exchangeable over nodes, 
i.e., p(P\X) = p(P2X). Therefore, restricting ourselves to Mg is sufficient for our exposition. 
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idea has been explored from different aspects (Kipf and Welling, 2016; Jin et al, 
2018a; Simonovsky and Komodakis, 2018; Liu et al, 2018d; Ma et al, 2018; Grover 
et al, 2019; Liu et al, 2019b) and is often collectively named as GraphVAE. In the 
following, we first highlight the common framework shared by all these methods 
and then discuss some important variants. 


11.3.2.1 The GraphVAE Family 


Similar to vanilla VAEs, every model instance within the GraphVAE family con- 
sists of an encoder (i.e., a variational distribution gg (Z|A,X) parameterized by ¢), 
a decoder (i.e., a conditional distribution pg(Y|Z) parameterized by 0), and a prior 
distribution (i.e., a distribution p(Z) typically with fixed parameters). Before intro- 
ducing individual components, we first describe what the latent variables Z are. In 
the context of graph generation, we typically assume that each node is associated 
with a latent vector. Denoting the latent vector of the i-th node as z;, then Z € R”xdz 
is obtained by stacking {z;} as row vectors. Such latent vectors should summarize 
the information of the local subgraphs associated with individual nodes so that we 
can decode/generate edges based on them. In other words, any pair of latent vec- 
tors (Z;,Zj) is supposed to be informative to determine whether nodes (i, j) should 
be connected. We could further introduce edge latent variables {z;;} to enrich the 
model. Again, we do not consider such an option for simplicity since the underlying 
modeling technique is roughly the same. 

Encoder We first explain how to construct the encoder using a deep neural net- 
work. Recall that the input to the encoder is the graph data (A,X). The natural can- 
didate to deal with such data is a graph neural network, e.g., a graph convolutional 
network (GCN) (Kipf and Welling, 2017b). For example, let us consider a two-layer 
GCN as below, 


H = Ao(AXW;)Wp, (11.5) 


where H € R”*¢# are the node representations (each node is associated with a size- 
dy row vector). A = D-2 (A+ 1)D-3 where D is the degree matrix (i.e., a diagonal 
matrix of which the entries are the row sum of A +I). I is the identity matrix. o is 
the nonlinearity which is often chosen to be the rectified linear unit (ReLU) (Nair 
and Hinton, 2010). {W1 , W2} are the learnable parameters. We can pad a constant to 
the input feature dimension so that the bias term is absorbed into the weight matrix. 
We adopt this convention for ease of notation. 

Relying on the learned node representations H, we can construct the variational 
distribution as below, 
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n 


qg(Z|A,X) = | | q(zilA,X) (11.6) 
i=1 

q(zi|A,X) = N (pi, oil) (11.7) 

= MLP,,(H) (11.8) 

logo = MLP, (H). (11.9) 


Here we typically assume that the variational distribution q(Z|A,X ) is conditionally 
node-wise independent for the tractability consideration. p; and g; are the i-th rows 
of u and ø respectively. The learnable parameters @ consist of all parameters of 
the two multi-layer perceptrons (MLPs) and the above GCN. Although the approx- 
imated variational distribution defined in Eq. (11.6) is simple, it possesses a few 
great properties. First, the probability distribution is invariant w.r.t. the permutation 
of nodes. Mathematically, it means that given two different permutation matrices 
P,P) € II, we have 


q(P\Z|P\AP|' ,P}X) = q(P2Z|P,AP; ,P2X) (11.10) 


This can be easily verified from the exchangeability of the product of probabilities 
and the equivariance property of graph neural networks. Second, the neural net- 
works underlying each Gaussian (i.e., “GNN + MLP”) are very powerful so that the 
conditional distributions are expressive in capturing the uncertainty of latent vari- 
ables. Third, this encoder is computationally cheaper than those which consider the 
dependencies among different {z;} (e.g., an autoregressive encoder). It thus pro- 
vides a solid baseline for investigating whether a more powerful encoder is needed 
in a given problem. 

Prior Similar to most VAEs, GraphVAEs often adopt a prior that is fixed during 
the learning. For example, a common choice is an node-independent Gaussian as 
below, 


p(Z) =| [p@) (11.11) 
i=l 


p(zi) = N~ (0,1). (11.12) 


Again, we could replace this fixed prior with more powerful ones like an autoregres- 
sive model at the cost of more computation and/or a time-consuming pre-training 
stage. But this prior serves as a good starting point to benchmark more complicated 
alternatives, e.g., the normalizing flow based one in (Liu et al, 2019b). 

Decoder The aim of a decoder in graph generative models is to construct a prob- 
ability distribution over the graph and its feature/attributes conditioned on the latent 
variables, i.e., p( |Z). However, as we discussed previously, we need to consider all 
possible node orderings (each corresponds to a permuted adjacency matrix) which 
leaves the graph unchanged, i.e., 


p(G|Z)= Y p(PAP' ,PX|Z). (11.13) 
Pellg 
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Recall that Ig is the maximal subset of the set of all possible permutation matrices 
II so that PiAzP! x PoAgPy holds for any Pı, P> € Ig. To build such a decoder, 
we first construct a probability distribution over adjacency matrix and node feature 
matrix. For example, we show a popular and simple construction (Kipf and Welling, 
2016) as below, 


p(A,X|Z) =J [p2] [p2 (11.14) 
ij i= 

p(Aij|Z) = Bernoulli (©;;) (11.15) 

P(Xi|Z) = N (fi, õi) (11.16) 

Oj; = MLPo ((zi||z;]) (11.17) 

ji = MLP (zi) (11.18) 

õi = MLP5(z)), (11.19) 


where we adopt an edge-independent Bernoulli distribution over edges and node- 
wise independent Gaussian distribution over node features. [z;||z;] means concate- 
nating z; and Zj. x; is the i-th row of node feature matrix X. The first product term 
in Eq. (11.14) sums over all n? possible edges. The learnable parameters consist of 
those of three MLPs. This decoder is simple yet powerful. However, given the latent 
variables Z, the decoder is not permutation invariant in general, i.e., for any two 
different permutation matrices P) and P», 


p(PIAP| ,PiX|Z) 4 p(PAP, ,PX|Z). (11.20) 


Note that there are corner cases so that p(P|AP,' ,PıX|Z) = p(P»AP,| ,P)X|Z) holds. 
For example, if an adjacency matrix A has certain symmetries, there could exist 
a pair of (P,,P)) so that PAP! = PAP} . But this does not hold for all pairs of 
(P,,P2). As a second example, if all ©;; are the same for all (i, j), all ñ; are the 
same for all i, and all &; are the same for all i, then for any two permutation ma- 
trices (P},P)), we have p(P|AP|' ,PıX|Z) = p(P:AP} ,P)X|Z). Nevertheless, these 
two cases happen rarely in practice. 

Equipped with the distribution in Eq. (11.14), we can evaluate the terms on the 
right hand side of Eq. (11.13). However, the number of permutation matrices in Tg 
can be as large as n! which makes the exact evaluation computationally prohibitive. 
There are a few ways in the literature to approximate it. For example, we can just 
use the maximum term as below, 


p(Y|Z)= } p(PAP',PX|Z) = max p(PAP',PX|Z). (11.21) 
Pelly Pellg 


Unfortunately, this maximization problem can be interpreted as an integer quadratic 
programming which is itself a hard optimization problem. To approximately solve 
the matching problem, Simonovsky and Komodakis (2018) exploit a relaxed max- 
pooling matching solver (Cho et al, 2014b). On the other hand, there are some 
canonical node orderings in certain applications. For example, the simplified molecular- 
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input line-entry system (SMILES) string (Weininger, 1988) provides a sequential 
ordering of atoms (nodes) of molecule graphs in chemistry. Based on the canoni- 
cal node ordering, we can construct the corresponding permutation P and simply 
approximate the conditional probability as, 


p(G|Z) = } p(PAP' ,PX|Z) ~ p(PAP' , PX|Z). (11.22) 
Pellg 


Objective The training objective of GraphVAE is similar to regular VAEs, i.e., 
the evidence lower bound (ELBO), 


a iqa (z\A.x) log po (Y|Z)| — KL(q¢(Z|A,X)||p(Z)) (11.23) 


To learn the encoder and the decoder, we need to sample from the encoder to ap- 
proximate the expectation in Eq. (11.23) and leverage the reparameterization trick 
(Kingma and Welling, 2014) to back-propagate the gradient. 


11.3.2.2 Hierarchical and Constrained GraphVAEs 


There are many variants derived from the GraphVAE family mentioned above. We 
now briefly introduce two important types of variants, i.e., hierarchical Graph VAE 
(Jin et al, 2018a) and Constrained Graph VAE (Liu et al, 2018d; Ma et al, 2018). 
Hierarchical GraphVAEs One representative work of hierarchical GraphVAEs 
is Junction Tree VAEs (Jin et al, 2018a) which aim at modeling the molecule graphs. 
The key idea is to build a GraphVAE relying on the hierarchical graph represen- 
tations of molecules. In particular, we first apply the tree decomposition to obtain 
a junction tree 7 from the original molecule graph Y. A junction tree is a cluster 
tree (each node is a set of one or more variables of the original graph) with the run- 
ning intersection property (Barber, 2004). It provides a coarsened representation of 
the original graph since one node in a junction tree may correspond to a subgraph 
with several nodes in the original graph. As shown in Figure 11.1, there are two 
graphs corresponding to two levels, i.e., the original molecule graph Y (1st level) 
and the decomposed junction tree .Y (2nd level). Since we can efficiently perform 
tree decomposition to obtain the junction tree, the tree itself is not a latent variable. 
Jin et al (2018a) propose to use Gated Graph Neural Networks (GGNNs) (Li et al, 
2016b) as encoders (one for each level) and construct variational posteriors g(Z¢|Y) 
and q(Z7|7) as Gaussians. To decode the molecule graph, we need to perform a 
two-level generation process conditioned on the sampled latent variables Zz and 
Zg. A junction tree is first generated by a autoregressive decoder which is again 
based on GGNNs. Conditioned on the generated tree, Jin et al (2018a) resort to 
maximum-a-posterior (MAP) formulation to generate the final molecule graph, i.e., 
finding the compatible subgraphs at each node of the tree so that the overall score 
(log-likelihood) of the resultant graph (i.e., replacing each node in the tree with the 
chosen subgraph) is maximized. The whole model can be learned similarly to other 
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Fig. 11.1: Junction Tree VAEs. The junction tree corresponding to the molecule 
graph is obtained via the tree decomposition as shown in the top-right. A node/clus- 
ter in the junction tree (color-shaded) may correspond to a subgraph in the original 
molecule graph. Two GNN-based encoders are applied to the molecular graph and 
junction tree respectively to construct the variational posterior distributions over 
latent variables Zy and Zz. During the generation, we first generate the junction 
tree using an autoregressive decoder and then obtains the final molecule graph via 
approximately solving a maximum-a-posterior problem. Adapted from Figure 3 of 
(Jin et al, 2018a). 


GraphVAEs. This model provides an interesting extension of GraphVAEs to hier- 
archical graph generation and demonstrates strong empirical performances. There 
are other important application-dependent details which greatly improve efficiency. 
For example, we can build a dictionary of chemically valid subgraphs so that each 
generation step in the 2nd level decoding generates a subgraph rather than a single 
node. Nevertheless, the model design largely relies on the efficiency of the chosen 
junction tree algorithm and certain application-dependent properties. It is unclear 
how well this model performs on general graphs other than molecules. 
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Constrained GraphVAEs In many applications of deep graph generative mod- 
els, certain constraints on the generated graphs are preferred. For example, while 
generating molecule graphs, the configuration of chemical bonds (edges) must meet 
the valence criteria of the atoms (nodes). How to ensure the generated graphs sat- 
isfy such constraints is a challenging problem. There are generally two types of 
approaches to overcome it in the context of Graph VAEs. The first type is to design 
a decoder so that all generated graphs satisfy the constraints by construction. For 
example, an autoregressive decoder is often adopted as in (Liu et al, 2018d; Dai 
et al, 2018b). At each step, conditioned on the currently generated graph, the model 
generates a new node, a new edge, and the node/edge attributes following certain 
tules, i.e., ruling out invalid options (those would violate the constraints) like what 
GrammarVAEs (Kusner et al, 2017) do. The other type of approach is to treat the 
constraints softly. Similar to how constrained optimization problems are converted 
to unconstrained ones by adding Lagrangians, Ma et al (2018) propose Lagrangian- 
based regularizers to incorporate constraints like valence constraint for molecule 
graphs, connectivity constraint, and node compatibility. The benefits of such meth- 
ods are that the generation could be much simpler and more efficient since we do not 
need a slow autoregressive decoder. Also, the regularization is only applied during 
learning and does not bring any overhead in the generation. Of course, the downside 
is that the generated graph my not exactly satisfy all constraints since the regular- 
ization only acts softly in the optimization. 


11.3.3 Deep Autoregressive Methods 


Deep autoregressive models like PixelRNNs (Van Oord et al, 2016) and PixelCNNs 
(Oord et al, 2016) have achieved tremendous successes in image modeling. There- 
fore, it is natural to generalize this type of method to graphs. The shared underlying 
idea of these autoregressive models is to characterize the graph generation process 
as a sequential decision-making process and make a new decision at each step con- 
ditioning on all previously made decisions. For example, as shown in Figure 11.2, 
we can first decide whether to add a new node, then decide whether to add a new 
edge, so on and so forth. If node/edge labels are considered, we can further sample 
from a categorical distribution at each step to specify such labels. The key question 
of this class of methods is how to build a probabilistic model so that our current 
decision depends on all previous historical choices. 


11.3.3.1 GNN-based Autoregressive Model 


The first GNN-based autoregressive model was proposed in (Li et al, 2018d) of 
which the high-level idea is exactly the same as shown in Figure 11.2. Sup- 
pose at time step t — 1, we already generated a partial graph denoted as Y'—! = 
(41-1, €'-1). The corresponding adjacency matrix and node feature matrix are de- 
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Fig. 11.2: The overview of the deep graph generative model in (Li et al, 2018d). 
The graph generation is formulated as a sequential decision-making process. At 
each step of the generation, the model needs to decide: 1) whether add a new node 
or stop the whole generation; 2) whether add a new edge (one end connected to the 
new node) or not; 3) which existing node to connect for the new edge. Adapted from 
Figure 1 of (Li et al, 2018d). 


noted as (A‘~!,X'~!), At time step t, the model needs to decide: 1) whether we 
add a new node or we stop the generation (denoting the probability as paddNode); 
2) whether we add an edge that links any existing node to the newly added node 
(denoting the probability as PAddEdge); 3) choose a existing node to link to the newly 
added node (denoting the probability as pNodes). For simplicity, we define paddNode 
to be a Bernoulli distribution. We can extend it to a categorical one if node labels/- 
types are considered. p AddEdge is yet another Bernoulli distribution whereas pNodes iS 
a categorical distribution with size |W‘—'| (i.e., its size will change as the generation 
goes on). 

Message Passing Graph Neural Networks To construct the above probabilities 
of decisions, we first build a message passing graph neural network (Scarselli et al, 
2008; Li et al, 2016b; Gilmer et al, 2017) to learn node representations. The input 
to the GNN at time step t — 1 is (A‘~', H'~!) where H‘! is the node representation 
(one row corresponds to a node). Note that at time 0, since the graph is empty, we 
need to generate a new node to start. The generation probability DadaNode Will be 
output by the model based on some randomly initialized hidden state. If we model 
the node labels/types or node features, we can also use them as additional node 
representations, e.g., concatenating them with rows of H‘~!. 

The one-step message passing is shown as below, 


m;j = fmsg(hi ~t, hf!) V(i,j) EE (11.24) 
m; = fage({mij|Vj E Q;}) VieV (11.25) 
hi! = fupaate (hi! ,m;) ViEeY, (11.26) 


where fsg, fAgg, and fUpdate are the message function, the aggregation function, and 
the node update function respectively. For the message function, we often instantiate 
JMsg as an MLP. Note that if edge features are considered, one can incorporate 
them as input to fMsg- fAgg could simply be an average or summation operator. 
Typical examples of {Update include gated recurrent units (GRUs) (Cho et al, 2014a) 
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and long-short term memory (LSTM) (Hochreiter and Schmidhuber, 1997). hi! is 
the input node representation at time step t — 1. Q; denotes the set of neighboring 
nodes of the node i. hi! is the updated node representation which serves as the 
input node representation for the next message passing step. The above message 
passing process is typically executed for a fixed number of steps, which is tuned 
as a hyperparameter. Note that the generation step rf is different from the message 
passing step (we deliberately omit its notation to avoid confusion). 

Output Probabilities After the message passing process is done, we obtain the 
new node representations H’. Now we can construct the aforementioned output 
probabilities as follows, 


hy:-1 = freadou(H') (11.27) 
PAddNode = Bernoulli (o (MLP adaNode (hyr-1 ))) (11.28) 
PAddEdge = Bernoulli (o (MLP AddEage (hy:—1 , hy) )) (11.29) 


Suy = MLPNodes (hi, 


u) 


h) Wie vt! (11.30) 
PNodes = Categorical (softmax(s)). (11.31) 


Here we first summarize the graph representation hy;-; (a vector) by reading out 
from the node representation H‘ via fReadout, Which could be an average operator 
or an attention-based one. Based on hy-1, we predict the probability of adding a 
new node paddNode Where o is the sigmoid function. If we decide to add a new 
node by sampling 1 from the Bernoulli distribution PAddNode, We denote the new 
node as v. We can initialize its representation h, as random features by sampling 
either from M (0,1) or learned distribution over node type/label if provided. Then 
we compute similarity scores between every existing node u in Y'—! and v as Suv. S 
is the concatenated vector of all similarity scores. Finally, we normalize the scores 
using softmax to form the categorical distribution from which we sample an existing 
node to obtain the new edge. By sampling from all these probabilities, we could 
either stop the generation or obtain a new graph with a new node and/or a new edge. 
We repeat this procedure by carrying on the node representations along with the 
generated graphs until the model generates a stop signal from padaNode- 

Training To train the model, we need to maximize the likelihood of the observed 
graphs. Recall that we need to consider the permutations that leave the graph un- 
changed as discussed in Section 11.3.2.1. For simplicity, we focus on the adjacency 
matrix alone following (Li et al, 2018d), i.e., = {PAP'|P € Ig}, where Hg is 
the maximal subset of IT so that P AP! Æ PAP} holds for any Pı, P> € Hg. The 
ideal objective is to maximize the following, 


max logp(Y) = max oe ( 2 pear); (11.32) 


Here we omit the variables being optimized, i.e., parameters of models defined in 
Eq. (11.24) and Eq. (11.27). Note that given a node ordering (corresponding to one 
specific permutation matrix P), we have a bijection between a sequence of cor- 
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rect decisions and an adjacency matrix. In other words, we can equivalently write 
p(PAP | ) as a product of probabilities that are explained in Eq. (11.27). However, the 
marginalization inside the logarithmic function on the right hand side is intractable 
due to the nearly factorial size of Ig in practice. Li et al (2018d) propose to ran- 
domly sample a few different node orderings as [ly and train the model with fol- 
lowing approximated objective, 


max log| ) p(PAP') |. (11.33) 
Pellg 


Note that this objective is a strict lower bound of the one in Eq. (11.32). If canonical 
node orderings like the SMILES ordering for molecule graphs are available, we can 
also use that to compute the above objective. 

Discussion This model formulates the graph generation as a sequential decision- 
making process and provides a GNN-based autoregressive model to construct prob- 
abilities of possible decisions at each step. The overall model design is well- 
motivated. It also achieves good empirical performances in generating small graphs 
like molecules (e.g., less than 40 nodes). However, since the model only generates at 
most one new node and one new edge per step, the total number of generation steps 
scales with the number of nodes quadratically for dense graphs. It is thus inefficient 
to generate moderately large graphs (e.g., with a few hundreds of nodes). 


11.3.3.2 Graph Recurrent Neural Networks (GraphRNN) 


Graph Recurrent Neural Networks (GraphRNN) (You et al, 2018b) is another deep 
autoregressive model which has a similar sequential decision-making formulation 
and leverages RNNs to construct the conditional probabilities. We again rely on 
the adjacency matrix representation of a graph, i.e., Y = {PAP'|P € Ty}. Before 
dealing with the permutations, let us assume the node ordering is given so that P = I. 

A Simple Variant of GraphRNN GraphRNN starts with an autoregressive de- 
composition of the probability of an adjacency matrix as follows, 


p(A) =[[p(4\A<), (11.34) 
t=1 


where A; is the ¢-th column of the adjacency matrix A and A <; is a matrix formed 
by columns A1,A2,- ,A;_1. n is the maximum number of nodes. If a graph has 
less than n nodes, we pad dummy nodes similarly as discussed in Section 11.3.1. 
Then we can construct the conditional probability as an edge-independent Bernoulli 
distribution, 
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p(A,|A<;) = Bernoulli(@,) = Coane (1-— @, ; MAO (11.35) 
i=l 

O; = fout(hr) (11.36) 

h; = firans (y—1,A;—1), (11.37) 


where ©, is a size-n vector of Bernoulli parameters. ©, ; denotes its i-th element. A;; 
denotes the i-th element of the column vector Az. four could be an MLP which takes 
the hidden state h; as input and outputs ©,. firans is the RNN cell function which 
takes the (t — 1)-th column of the adjacency matrix A;_; and the hidden state h,_1 
as input and outputs the current hidden state h;. We can use an LSTM or GRU as 
the RNN cell function. Note that the conditioning on A< is implemented via the 
recurrent use of the hidden state in an RNN. The hidden state can be initialized as 
zeros or randomly sampled from a standard normal distribution. This model variant 
is very simple and can be easily implemented since it only consists of a few common 
neural network modules, i.e., an RNN and an MLP. 

Full Version of GraphRNN To further improve the model, You et al (2018b) 
propose a full version of GraphRNN. The idea is to build a hierarchical RNN so that 
the conditional distribution in Eq. (11.34) becomes more expressive. Specifically, 
instead of using an edge-independent Bernoulli distribution, we leverage another 
autoregressive construction to model the dependencies among entries within one 
column of the adjacency matrix as below, 


P(Ar\Act) = [ [p(AitlA<i<r) (11.38) 
i=1 

p(Aitl|A<i, <t) = sigmoid (gout(hj1)) (11.39) 

hj, = Strans (hy_17,A<iz) (11.40) 

ho; =h; (11.41) 

h; = firans (hy—1 Ar-1). d 1.42) 


Here the bottom RNN cell function ftrans still recurrently updates the hidden state 
to get h,, thus implementing the conditioning on all previous ¢ — 1 columns of the 
adjacency matrix A. To generate individual entries of the t-th column, the top RNN 
cell function girans takes its own hidden state h,_;, and the already generated f-th 
column A as input and updates the hidden state as hj;. The output distribution is a 
Bernoulli parameterized by the output of an MLP gout which takes hj, as input. Note 
that the initial hidden state ho, of the top RNN is set to the hidden state h, returned 
by the bottom RNN. 

Objective To train the GraphRNN, we can again resort to the maximum log 
likelihood similarly to Section 11.3.3.1. We also need to deal with permutations of 
nodes that leave the graph unchanged. Instead of randomly sampling a few orderings 
like (Li et al, 2018d), You et al (2018b) propose to use a random-breadth-first-search 
ordering. The idea is to first randomly sample a node ordering and then pick the first 
node in this ordering as the root. A breadth-first-search (BFS) algorithm is applied 
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starting from this root node to generate the final node ordering. Let us denote the 
corresponding permutation matrix as Pgs. The final objective is, 


max log ( p(ParsAPšrs) ) (11.43) 


which is again a strict lower bound of the true log likelihood. Empirical results in 
(You et al, 2018b) suggest that this random-BFS ordering provides good perfor- 
mances on a few benchmarks. 

Discussion The design of the GraphRNN is simple yet effective. The implemen- 
tation is straightforward since most of the modules are standard. The simple variant 
is more efficient than the previous GNN-based model (Li et al, 2018d) since it gener- 
ates multiple edges (corresponding to one column of the adjacency matrix) per step. 
Moreover, the simple variant performs comparably with the full version in the ex- 
periments. Nevertheless, GraphRNN still has certain limitations. For example, RNN 
highly depends on the node ordering since different node orderings would result in 
very different hidden states. The sequential ordering could make two nearby (even 
neighboring) nodes far away in the generation sequence (i.e., far away in the gen- 
eration time step). Typically, hidden states of an RNN that are far away regarding 
the generation time step tend to be quite different, thus making it hard for the model 
to learn that these nearby nodes should be connected. We call this phenomenon the 
sequential ordering bias. 


11.3.3.3 Graph Recurrent Attention Networks (GRAN) 


Following the line of the work (Li et al, 2018d; You et al, 2018b), Liao et al (2019a) 
propose the graph recurrent attention networks (GRAN). It is a GNN-based autore- 
gressive model, which greatly improves the previous GNN-based model (Li et al, 
2018d) in terms of capacity and efficiency. Furthermore, it alleviates the sequential 
ordering bias of GraphRNN (You et al, 2018b). In the following, we introduce the 
details of the model. 

Model We start with the adjacency matrix representation of graphs, i.e., G = 
{PAP'|P € Ig}. GRAN aims at directly building a probabilistic model over the 
adjacency matrix similarly to GraphRNN. Again, node/edge features are not of pri- 
mary interests but can be incorporated without much modification to the model. In 
particular, from the perspective of modeling the adjacency matrix, the GNN-based 
autoregressive model in (Li et al, 2018d) generates one entry of the adjacency matrix 
at a step, whereas GraphRNN (You et al, 2018b) generates one column of entries at 
a step. GRAN takes a step further along this line by generating a block of column- 
s/rows~ of the adjacency matrix at a step, which greatly improves the generation 
speed. Denoting the submatrix with first k rows of the adjacency matrix A as A}.;.., 
we have the following autoregressive decomposition of the probability, 


2 Since we are mainly interested in simple graphs, i.e., unweighted, undirected graphs containing 
no self-loops or multiple edges, modeling columns or rows makes no difference. We adopt the 
row-wise notations to follow the original paper. 
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Fig. 11.3: The overview of the graph recurrent attention networks (GRAN). At each 
step, given an already generated graph, we add a new block of nodes (block size 
is 2 and color indicates the membership of individual group in the visualization) 
and augmented edges (dashed lines). Then we apply GRAN to this graph to ob- 
tain the output distribution over augmented edges (we show an edge-independent 
Bernoulli where the line width indicates the probability of generating individual 
augmented edges). Finally, we sample from the output distribution to obtain a new 
graph. Adapted from Figure 1 of (Liao et al, 2019a). 


[n/K] 
P(A) = [] PAG eye :|Ag—1ye:) (11.44) 
t=1 


where A.(;_1)x,; indicates the adjacency matrix that has been generated before the 
t-th step (i.e., t — 1 blocks with block size k). We use A (_ 1)x.1,; to denote the to-be- 
generated block at t-th time step. Note that this part is a straightforward generaliza- 
tion to the autoregressive model of GraphRNNs in Eq. (11.34). 

To build the condition probability p(A(—1)k:r%,:|A:(¢—-1)k,;), GRAN leverages a 
message passing graph neural network. Specifically, denoting the already gener- 
ated graph before step ¢ (corresponding to A.(;_1)x,;) as gl = (V! 1), we 
first initialize every node representation vector with its corresponding row of the 
adjacency matrix, i.e., h, = Ay, for all v < (t — 1)k. Since we assume the maximum 
number of nodes is n and pad dummy nodes for graphs with a smaller size, h, is of 
size n. At time step t, we are interested in generating a new block of nodes (corre- 
sponding to A(r—1)k:rk,:) and their associated edges. For the k new nodes in the t-th 
block, since their corresponding rows in the adjacency matrix are initially all zeros, 
we give them an arbitrary ordering from 1 to k and use the one-hot-encoding of the 
order index as an additional representation to distinguish them, denoting as x,. We 
first form a new graph 4 = (V',é") by connecting the k new nodes to themselves 
(excluding self-loops) and every other nodes in ‘~t. We call such edges as the aug- 
mented edges, which are shown as the dashed edges in Figure 11.3. In other words, 
Y“ is the union of Y‘~! and k new nodes whereas 6" is the union of &’~! and aug- 
mented edges. The core part of GRAN is to construct a probability distribution over 
such augmented edges from which we can sample a new graph 4’. Note that ¥’ has 
the same set of nodes but potentially fewer edges compared to 4. To construct the 
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probability, we use a GNN with the following one-step message passing process, 


mij = fmsg(hj — hj), v(i j) e & (11.45) 
h; = [h;||x;], Vie V' (11.46) 
aij = sigmoid (gan (hj —h,)) , Vi, j) € & (11.47) 
h; = GRU (hj, jeli aijm;j), View" (11.48) 


where mj; is the again the message over edge (i, j) and Q; is the set of neighbor- 
ing nodes of node i. The message function fmsg and the attention head gat, could 
be MLPs. Note that we set x, to zeros for any node u that is in the already gener- 
ated graph Y'—! since the one-hot-encoding is only used to distinguish those newly 
added nodes. [a||b] means concatenating two vectors a and b. The updated node 
representation h; would serve as the input to the next message passing step. We 
typically unroll this message passing for a fixed number of steps, which is set as a 
hyperparameter. Note that the message passing step is independent of the generation 
step. The attention weights a;; depends on the one-hot-encoding x; so that messages 
on augmented edges could be weighted differently compared to those on edges be- 
longing to &’—!. Based on the final node representations returned by the message 
passing, we can construct the output distribution is as follows, 


n 


C tK 
P(A (t—1)kitk,: |A:¢—1)x,:) = 2 Qc I] II Oi, (11.49) 
c=1  i=(t-1)k+1 j=l 


tK n 
œ = softmax ( YY MLP, (hi -¥) (11.50) 


i=(t-1)k+1 j=l 
@,.i,; = sigmoid (MLP@ (hf —h’*)) . (11.51) 


Here we use a mixture of Bernoulli distributions where the mixture coefficients 
are œ = {Q,---,Q@c} and the parameters are {@,; j}. Compared to the edge- 
independent Bernoulli distribution used in the simple variant of GraphRNN, this 
output distribution can capture dependencies among multiple generated edges. Fur- 
thermore, it is more efficient to sample compared to the autoregressive distribution 
used in the full version of GraphRNN. 

Objective To train the model, we also need to deal with permutations in order 
to maximize the log likelihood. Similar to the strategy used in (Li et al, 2018d; 
You et al, 2018b), Liao et al (2019a) propose to use a set of canonical orderings, 
i.e., breadth-first-search (BFS), depth-first-search (DFS), node-degree-descending, 
node-degree-ascending, and the k-core ordering. In particular, the BFS and the DFS 
ordering start from the node with the largest node degree. The k-core graph decom- 
position has been shown to be very useful for modeling cohesive groups in social 
networks (Seidman, 1983). The k-core of a graph Y is a maximal subgraph that 
contains nodes of degree k or more. Cores are nested, i.e., i-core belongs to j-core 
if i > j, but they are not necessarily connected subgraphs. Most importantly, the 
core decomposition, i.e., all cores ranked based on their orders, can be found in lin- 
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ear time (w.r.t. the number of edges) (Batagelj and Zaversnik, 2003). Based on the 
largest core number per node, we can uniquely determine a partition of all nodes, 
i.e., disjoint sets of nodes which share the same largest core number. We then assign 
the core number of each disjoint set by the largest core number of its nodes. Starting 
from the set with the largest core number, we rank all nodes within the set in node 
degree descending order. Then we move to the second largest core and so on to ob- 
tain the final ordering of all nodes. We call this core descending ordering as k-core 
node ordering. 
Our final training objective is, 


max log| } p(PAP') |. (11.52) 
Pellg 


where [Ty is the set of permutation matrices corresponding to the above node order- 
ings. This is again a strict lower bound of the true log likelihood. 

Discussion GRAN improves the previous GNN-based autoregressive model (Li 
et al, 2018d) and GraphRNN (You et al, 2018b) in the following ways. First, it gen- 
erates a block of rows of the adjacency matrix per step, which is more efficient than 
generating an entry per step and then generating a row per step. Second, GRAN 
uses a GNN to construct the conditional probability. This helps alleviate the se- 
quential ordering bias in GraphRNN since GNN is permutation equivariant, i.e., 
the node ordering would not affect the conditional probability per step. Third, the 
output distribution in GRAN is more expressive and more efficient for sampling. 
GRAN outperforms previous deep graph generative models in terms of empirical 
performances and the sizes of graphs that can be generated (e.g., GRAN can gener- 
ate graphs up to 5K nodes). Nevertheless, GRAN still suffers from the fact that the 
overall model depends on the particular choices of node orderings. It may be hard 
to find good orderings in certain applications. How to build an order-invariant deep 
graph generative model would be an interesting open question. 


11.3.4 Generative Adversarial Methods 


In this part, we review a few methods (De Cao and Kipf, 2018; Bojchevski et al, 
2018; You et al, 2018a) that apply the idea of generative adversarial networks (GAN) 
(Goodfellow et al, 2014b) in the context of graph generation. Based on how a graph 
is represented during training, we roughly divide them into two categories: adja- 
cency matrix based and random walks based methods. In the following, we explain 
these two types of methods in detail. 
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11.3.4.1 Adjacency Matrix Based GAN 


MOoIGAN (De Cao and Kipf, 2018) and graph convolutional policy network (GCPN) 
(You et al, 2018a) propose a similar GAN-based framework to generate molecule 
graphs that satisfy certain chemical properties. Here the graph data is represented 
slightly different from previous sections since one needs to specify both node types 
(i.e., atom types) and edge types (i.e., chemical bond types). We denote the ad- 
jacency matrix? as A € RY*N*Y where Y is the number of chemical bond types. 
Basically, one slice along the 3rd dimension of A gives an adjacency matrix that 
characterizes the connectivities among atoms under a specific chemical bond type. 
We denote the node type as X € RY*T where T is the number of atom types. The 
goal is to generate (A,X) so that it is similar to observed molecule graphs and pos- 
sesses certain desirable properties. 
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Fig. 11.4: The overview of the MolGAN. We first draw a latent variable Z ~ p(Z) 
and feed it to a generator which produces a probabilistic (continuous) adjacency ma- 
trix A and a probabilistic (continuous) node type matrix X. Then we draw a discrete 
adjacency matrix A ~ A and a discrete node type matrix ~ X, which together spec- 
ify a molecule graph. During training, we simultaneously feed the generated graph 
to a discriminator and a reward network to obtain the adversarial loss (measuring 
how similar the generated and the observed graphs are) and the negative reward 
(measuring how likely the generated graphs satisfy the certain chemical constraints). 
Adapted from Figure 2 of (De Cao and Kipf, 2018). 


Model We now explain the details of MolGAN and then highlight the difference 
between GCPN and MoIGAN. Similar to regular GANs, MoIGAN consists of a 
generator %@(Z) and a discriminator Zy(A,X). To ensure the generated samples 
satisfy desirable chemical properties, MoIGAN adopts an additional reward network 
Ry(A,X). The overall pipeline of MolGAN is illustrated in Figure 11.4. 

To generate a molecule graph, we first sample a latent variable Z € R? from some 
prior, e.g., Z ~ “N (0,1). Then we use an MLP to directly map the sampled Z to a 
continuous adjacency matrix A and a continuous node type matrix X. The contin- 
uous version of the graph data has a natural probabilistic interpretation, i.e., Aj j,c 


3 Note that A is actually a tensor. We slightly abuse the terminology here to ease the exposition. 
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means the probability of connecting the atom i and the atom j using the chemical 
bond type c, whereas X;; means the probability of assigning the f-th atom type to 
the i-atom. One can sample a discrete graph data (A,X) from the continuous ver- 
sion, ie., A ~ A and X ~ X. This sampling procedure can be implemented using the 
Gumbel softmax (Jang et al, 2017; Maddison et al, 2017). The discrete adjacency 
matrix A along with the discrete node type X specify a molecule graph and complete 
the generation process. 

To evaluate how similar the generated graphs and the observed graphs are, we 
need to build a discriminator. Since we are dealing with graphs, the natural can- 
didate for a discriminator is a graph neural network, e.g., a graph convolutional 
network (GCN) (Kipf and Welling, 2017b). In particular, we use a variant of GCN 
(Schlichtkrull et al, 2018) to incorporate multiple edge types. One such graph con- 
volutional layer is shown as below, 


, N Y i 
h; = tanh | f,(hj,x;) + L È To fy(hj, xi) |, (11.53) 


where h; and h; are the input and the output node representations of the graph convo- 
lutional layer. Q; is the set of neighboring nodes of the node i. x; is the i-th row of X, 
i.e., the node type vector of the node i. fs and f, are linear transformation functions 
that are to be learned. After stacking this type of graph convolution for multiple lay- 
ers, we can readout the graph representation using the following attention-weighted 
aggregation, 


hg = tanh ( L sigmoid (MLP u(h,,x,)) © tanh ex) ; (11.54) 
vey 


where h, is the node representation returned by the top graph convolutional layer. 
Note that MLP. and MLP are two different instances of MLPs. © means element- 
wise product. We can use the graph representation vector hy to compute the dis- 
criminator score 24 (A,X), i.e., the probability of classifying a graph as positive 
(i.e., coming from the data distribution). 

Objective Originally, GANs learn the model by performing the minimax opti- 
mization as below, 


ie DA. X~paata(A,X) log 24 (A,X)] + Ezp) log (1 — Dy (Go(Z)))], 
(11.55) 


where the generator aims at fooling the discriminator and the discriminator aims 
at correctly classifying the generated samples and the observed samples. To ad- 
dress certain issues in training GANs such as the mode collapse and the instability, 
Wasserstein GAN (WGAN) (Arjovsky et al, 2017) and its improved version (Gul- 
rajani et al, 2017) have been proposed. MoIGAN follows the improved WGAN and 
uses the following objective to train the discriminator Fy (A,X), 
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; ; a rarer 2 
max J) —99(A9,X) + Ip (PZ) +a (Vro g0 2O, R)|-1) 
(11.56) 


where B is the mini-batch size, Z is the i-th sample drawn from the prior, AMX 
are the i-th graph data drawn from the data distribution, and A® RO are their linear 
combinations, ie., (A,X) = e(A, xX) + (1 — €)%(Z), € ~ Y (0,1). The 
squared term on the right-hand side penalizes the gradient of the discriminator so 
that the training becomes more stable. œ is a weighting term to balance the regular- 
ization and the objective. Moreover, fixing the discriminator, we train the generator 


G (A,X) by adding the additional constraint-dependent reward, 
B pan . ai . 
min $422 (Z®)) +0- 14) Ler (Go(Z)), (11.57) 
i=l 


where -Zpy_ is the negative reward returned by the reward network Zy and A is the 
weighting hyperparameter to regulate the trade-off between two losses. The reward 
could be some non-differentiable quantities that characterize the chemical proper- 
ties of the generated molecules, e.g., how likely the generated molecule is to be 
soluble in water. To learn the model with the non-differentiable reward, the deep de- 
terministic policy gradient (DDPG) (Lillicrap et al, 2015) is used. The architecture 
of the reward network is the same as the discriminator, i.e., a GCN. It is pre-trained 
by minimizing the squared error between the predicted reward given by Zy and an 
external software which produces a property score per molecule. The pre-training is 
necessary since the external software is typically slow and could significantly delay 
the training if it is included in the whole training framework. 

Discussion MolIGAN demonstrates strong empirical performances on a large 
chemical database called QM9 (Ramakrishnan et al, 2014). Similar to other GANs, 
the model is likelihood-free and can thus enjoy more flexible and powerful gener- 
ators. More importantly, although the generator still depends on the node ordering, 
the discriminator and the reward networks are order (permutation) invariant since 
they are built from GNNs. Interestingly enough, graph convolutional policy net- 
work (GCPN) (You et al, 2018a) solves the same problem using a similar approach. 
GCPN has a similar GAN-type of objective and some additional domain-specific 
rewards that capture the chemical properties of the molecules. It also learns both a 
generator and a discriminator. However, they do not use a reward network to speed 
up the reward computation. To deal with the learning of non-differentiable reward, 
GCPN leverages the proximal policy optimization (PPO) (Schulman et al, 2017) 
method, which empirically performs better than the vanilla policy gradient method. 
Another important difference is that GCPN generates the adjacency matrix in an 
entry-by-entry autoregressive fashion so that the dependencies among multiple gen- 
erated edges are captured whereas MoIGAN generates all entries of the adjacency 
matrix in parallel conditioned on the latent variable. GCPN also achieves impres- 
sive empirical results on another large chemical database called ZINC (Irwin et al, 
2012). Nevertheless, there are still limitations with the above models. The discrete 
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gradient estimators (e.g., the policy gradient type of methods) could have large vari- 
ances, which may slow down the training. Since the domain-specific rewards are 
non-differentiable and may be time-consuming to obtain, learning a neural network 
based approximated reward function like what MolGAN does is appealing. How- 
ever, as reported in MoIGAN, pre-training seems to be crucial to make the whole 
training successful. More exploration along the line of learning a reward function 
would be beneficial to simplify the whole training pipeline. On the other hand, both 
methods use some variant of GCNs as the discriminator, which is shown to be in- 
sufficient in distinguishing certain graphs* (Xu et al, 2019d). Therefore, exploring 
more powerful discriminators like the Lanczos network (Liao et al, 2019b) that ex- 
ploits the spectrum of the graph Laplacian as the input feature would be promising 
to further improve the performance of the above methods. 


11.3.4.2 Random Walk Based GAN 


In contrast to previous methods, NetGAN (Bojchevski et al, 2018) resorts to the 
random walk based representations of graphs. The key idea is to map a graph to a 
set of random walks and learn a generator and a discriminator in the space of ran- 
dom walks. The generator should generate random walks that are similar to those 
sampled from the observed graphs, whereas the discriminator should correctly dis- 
tinguish whether a random walk comes from the data distribution or the implicit 
distribution corresponding to the generator. 

Model We start by sampling a set of random walks with fixed length T from 
the given graph Y using the biased second order random walk sampling strategy 
described in (Grover and Leskovec, 2016). We denote a random walk as a sequence 
(v1, ++ , v7) where v; represents one node in Y. Note that a random walk may contain 
duplicate nodes since it could revisit one node multiple times during the sampling. 
We again assume the maximum number of nodes for any graph is N. For any node 
vi, we use the one-hot-encoding vector as its node feature. In other words, we can 
view a random walk with a sequence along with its features. Therefore, similar to 
language models, it is natural to use an RNN as the generator for generating such 
random walks. NetGAN exploits an LSTM as the generator of which the initial 
hidden state hp and the memory co are computed by feeding a randomly sampled 
latent vector (drawn from ~ (0,1)) to two separate MLPs. Then the LSTM generator 
predicts a categorical distribution over all possible nodes and then samples a node. 
The one-hot-encoding of the node index is treated as the node representation and 
fed to the LSTM generator as the input for the next step. We unroll this LSTM for 
T steps to obtain the final length-7 random walk. For the discriminator, we can 
use another LSTM, which takes a random walk as input and predicts the probability 
that a given random walk is sampled from the data distribution. The model is trained 
with the same objective as the improved WGAN (Gulrajani et al, 2017). 


4 For example, a GCN can not distinguish two triangles versus a six node circle (both have the 
same number of nodes and every node has exactly two neighbors) assuming all individual node 
features are identical. 
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Fig. 11.5: The overview of the NetGAN. We first draw a random vector from a fixed 
prior .4 (0,1) and initialized the memory co and the hidden state họ of an LSTM. 
Then the LSTM generator generates which node to visit per step and is unrolled for 
a fixed number of steps T. The one-hot-encoding of node index is fed to the LSTM 
as the input for the next step. The discriminator is another LSTM which performs 
a binary classification to predict if a given random walk is sampled from a data 
distribution. Adapted from Figure 2 of (Bojchevski et al, 2018). 


After training the LSTM generator, we are capable of generating random walks. 
However, we need an additional step to construct a graph from a set of generated 
random walks. The strategy used by NetGAN is as follows. First, we count the edges 
that appeared in the set of random walks to obtain a scoring matrix S, which has the 
same size as the adjacency matrix. The (i, j)-th entry of the score matrix S;,; in- 
dicates how many times edge (i, j) appears in the set of generated random walks. 
Second, for each node i, we sample a neighbor according to the probability re m 
We repeat the sampling until node į has at least one neighbor connected and skip if 
the edge has already been generated. At last, for any edge (i, j), we perform sam- 


pling without replacement according to the probability T Sj until the maximum 
u,v 9u,v 


number of edges is reached. 

Discussion The random walk based representations for graphs are novel in the 
context of deep graph generative models. Moreover, they could be more scalable 
than the adjacency matrix representation since we are not bound by the quadratic 
(w.r.t. the number of nodes) complexity. The core modules of the NetGAN are 
LSTMs which are efficient in handling sequences and easy to be implemented. Nev- 
ertheless, the graph construction from a set of generated random walks seems to be 
ad-hoc. There is no theoretical guarantee on how accurate the proposed construc- 
tion method is. It may require a large number of sampled random walks in order to 
generate a graph with good qualities. 
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11.4 Summary 


In this chapter, we review a few classic graph generative models and some modern 
ones which are constructed based on deep neural networks. From the perspectives 
of the model capacity and the empirical performances, e.g., how good the model can 
fit observed data, deep graph generative models significantly outperform their clas- 
sic counterparts. For example, they could generate molecule graphs which are both 
chemically valid and similar to observed ones in terms of certain graph statistics. 

Although we have already made impressive progress in recent years, deep gen- 
erative models are still in the early stage. Moving forward, there are at least two 
main challenges. First, how can we scale these models so that they can handle real- 
world graphs like large scale social networks and WWW? It requires not only more 
computational resources but also more algorithmic improvements. For example, 
building a hierarchical graph generative model would be one promising direction 
to boost efficiency and scale. Second, how can we effectively add domain-specific 
constraints and/or conditioning on some input information? This question is impor- 
tant since many real-world applications require the graph generation to be condi- 
tioned on some inputs (e.g., scene graph generations conditioned on input images). 
Many graphs in practice come with certain constraints (e.g., chemical validity in the 
molecule generation). 


Editor’s Notes: Deep learning-based graph generation can be considered as 
a downstream task of graph representation learning, where the learned rep- 
resentations are usually enforced to follow some probabilistic assumptions. 
Hence the techniques in this topic widely enjoy the relevant properties and 
theories introduced in the previous chapters, such as scalability (Chapter 
6), expressiveness power (Chapter 5), and robustness (Chapter 8). Graph 
generation also further motivates its downstream tasks in various interest- 
ing, important, yet usually challenging areas such as drug discovery (see 
Chapter 24), protein analysis (see Chapter 25), and program synthesis (see 
Chapter 22). 
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Abstract Many problems regarding structured predictions are encountered in the 
process of “transforming” a graph in the source domain into another graph in target 
domain, which requires to learn a transformation mapping from the source to target 
domains. For example, it is important to study how structural connectivity influences 
functional connectivity in brain networks and traffic networks. It is also common to 
study how a protein (e.g., a network of atoms) folds, from its primary structure 
to tertiary structure. In this chapter, we focus on the transformation problem that 
involves graphs in the domain of deep graph neural networks. First, the problem 
of graph transformation in the domain of graph neural networks are formalized in 
Section 12.1. Considering the entities that are being transformed during the trans- 
formation process, the graph transformation problem is further divided into four 
categories, namely node-level transformation, edge-level transformation, node-edge 
co-transformation, as well as other graph-involved transformations (e.g., sequence- 
to-graph transformation and context-to-graph transformation), which are discussed 
in Section 12.2 to Section 12.5, respectively. In each subsection, the definition of 
each category and their unique challenges are provided. Then, several representa- 
tive graph transformation models that address the challenges from different aspects 
for each category are introduced. 
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12.1 Problem Formulation of Graph Transformation 


Many problems regarding structured predictions are encountered in the process of 
“translating” an input data (e.g., images, texts) into a corresponding output data, 
which is to learn a translation mapping from the input domain to the target do- 
main. For example, many problems in computer vision can be seen as a “transla- 
tion” from an input image into a corresponding output image. Similar applications 
can also be found in language translation, where sentences (sequences of words) 
in one language are translated into corresponding sentences in another language. 
Such a generic translation problem, which is important yet has been extremely dif- 
ficult in nature, has attracted rapidly increasing attention in recent years. The con- 
ventional data transformation problem typically considers the data under special 
topology. For example, an image is a type of grid where each pixel is a node and 
each node has connections to its spatial neighbors. Texts are typically considered 
as sequences where each node is a word and an edge exists between two contextual 
words. Both grids and sequences are special types of graphs. In many practical ap- 
plications, it is required to work on data with more flexible structures than grids and 
sequences, and hence more powerful translation techniques are required in order 
to handle more generic graph-structured data. Thus, there emerges a new problem 
named deep graph transformation, the goal of which is to learn the mapping from 
the graph in the input domain to the graph in the target domain. The mathematical 
problem formulation of the graph is provided in detail as below. 

A graph is defined as 9 (V ,&,F, E), where V is the set of N nodes, and & C V x 
Y is the set of M edges. e; ; € £ is an edge connecting nodes v;, vj E€ Y. A graph can 
be described in matrix or tensor using its (weighted) adjacency matrix A. If the graph 
has node attributes and edge attributes, there are node attribute matrix F € RYL 
where D is the number of node attributes, and edge attribute tensor E € RN*N*K 
where K is the number of edge attributes. L is the dimension of node attributes, and 
K is the dimension of edge attributes. Based on the definition of graph, we define 
the input graphs from the source domain as {s and the output graphs from the target 
domain as s — Śr (Guo et al, 2019c). 

Considering the entities that are being transformed during the transformation 
process, the graph transformation problem is further divided into three categories, 
namely (1) node-level transformation, where only nodes and nodes attributes can 
change during translation process; (2) edge-level transformation, where only topol- 
ogy or edge attributes can change during translation process; (3) node-edge co- 
transformation where both nodes and edges can change during translation process. 
There are also some other transformations involving graphs, including sequence-to- 
graph transformation, graph-to-sequence transformation and context-to-graph trans- 
formation. Although they can be absorbed into the above three types if regarding 
sequences as a special case of graphs, we want to separate them out because they 
may usually attract different research communities. 
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12.2 Node-level Transformation 


12.2.1 Definition of Node-level Transformation 


Node-level transformation aims to generate or predict the node attributes or node 
category of the target graph conditioning on the input graph. It can also be regarded 
as a node prediction problem with stochasticity. It requires the node set VY or node 
attributes F to change while the graph edge set and edge attributes are fixed during 
the transformation namely %(%,&,Fs,E) > Gr(Vr,&, Fr, E). Node transforma- 
tion has a wide range of real-world applications, such as predicting future states 
of a system in the physical domain based on the fixed relations (e.g. gravitational 
forces) (Battaglia et al, 2016) among nodes and the traffic speed forecasting on the 
road networks (Yu et al, 2018a; Li et al, 2018e). Existing works adopt different 
frameworks to model the transformation process. 

Generally speaking, the straightforward way in dealing with the node translation 
problem is to regard it as the node prediction problem and utilize the conventional 
GNNs as encoder to learn the node embedding. Then, based on the node embed- 
ding, we can predict the node attributes of the target graphs. While solving the 
node transformation problem in specific domains, there come various unique re- 
quirements, such as considering the spatial and temporal patterns in the traffic flow 
prediction task. Thus, in this section, we focus on introducing three typical node 
transformation models in dealing with problems in different areas. 


12.2.2 Interaction Networks 


Battaglia et al (2016) proposed the interaction network in the task of reasoning about 
objects, relations, and physics, which is central to human intelligence, and a key 
goal of artificial intelligence. Many physical problems, such as predicting what will 
happen next in physical environments or inferring underlying properties of complex 
scenes, are challenging because their elements are composed and can influence each 
other as a whole system. It is impossible to solve such problems by considering each 
object and relation separately. Thus, the node transformation problem can help deal 
with this task via modeling the interactions and dynamics of elements in a complex 
system. To deal with the node transformation problem that is formalized in this sce- 
nario, an interaction network (IN) is proposed, which combines two main powerful 
approaches: structured models, simulation, and deep learning. Structured models 
are operated as the main component based on the GNNs to exploit the knowledge 
of relations among objects. The simulation part is an effective method for approx- 
imating dynamical systems, predicting how the elements in a complex system are 
influenced by interactions with one another, and by the dynamics of the system. 
The overall complex system can be represented as an attributed, directed multi- 
graph %, where each node represents an object and the edge represents the rela- 
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tionship between two objects, e.g., a fixed object attached by a spring to a freely 
moving mass. To predict the dynamics of a single node (i.e., object), there is an 
object-centric function, hit! = fo(h{) with the object’s state h, at time t of the ob- 
ject v; as the inputs and a future state hit! at next time step as outputs. Assuming 
two objects have one directed relationship, the first object v; influences the second 
object v; via their interaction. The effect or influence of this interaction, eji +1 jis pre- 
dicted by a relation-centric function, fr, with the object states as well as attributes 
of their relationship as inputs. The object updating process is then written as: 

eft) = fr(bhi, hi r;); byt! = fo(hi,ess'), (12.1) 
where r; refers to the interaction effects that node v; receives. 

It worth noting that the above operations are for an attributed, directed multi- 
graph because the edges/ relations can have attributes, and there can be multiple 
distinct relations between two objects (e.g., rigid and magnetic interactions). In 
summary, at each step, the interaction effects generated from each relationship is 
calculated and then an aggregation function is utilized to summarize all the interac- 
tions effects on the relevant objects and update the states of each object. 

An IN applies the same fr and fo to every target nodes, respectively, which 
makes their relational and object reasoning able to handle variable numbers of arbi- 
trarily ordered objects and relations (i.e., graphs with variables sizes). But one addi- 
tional constraint must be satisfied to maintain this: the aggregation function must be 
commutative and associative over the objects and relations, for example summation 
as aggregation function satisfies this, but division would not. 

The IN can be included in the framework of Message Passing Neural Network 
(MPNN), with the message passing process, aggregation process, and node updat- 
ing process. However, different from MPNN models which focus on binary relations 
(i.e., there is one edge per pair of nodes), IN can also handle hyper-graph, where the 
edges can correspond to n-th order relations by combining n nodes (n > 2). The 
IN has shown a strong ability to learn accurate physical simulations and generalize 
their training to novel systems with different numbers and configurations of objects 
and relations. They could also learn to infer abstract properties of physical systems, 
such as potential energy. The IN implementation is the first learnable physics en- 
gine that can scale up to real-world problems, and is a promising template for new 
AI approaches to reasoning about other physical and mechanical systems, scene 
understanding, social perception, hierarchical planning, and analogical reasoning. 


12.2.3 Spatio-Temporal Convolution Recurrent Neural Networks 


Spatio-temporal forecasting is a crucial task for a learning system that operates in 
a dynamic environment. It has a wide range of applications from autonomous ve- 
hicles operations, to energy and smart grid optimization, to logistics and supply 
chain management. The traffic forecasting on road networks, the core component 
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of the intelligent transportation systems, can be formalized as a node transforma- 
tion problem, the goal of which is to predict the future traffic speeds (i.e., node 
attributes) of a sensor network (i.e., graph) given historic traffic speeds (i.e., his- 
tory node attributes). This type of node transformation is unique and challenging 
due to the complex spatio-temporal dependencies in a series of graphs and inherent 
difficulty in the long term forecasting. To deal with this, each pair-wise spatial cor- 
relation between traffic sensors is represented using a directed graph whose nodes 
are sensors and edge weights denote proximity between the sensor pairs measured 
by the road network distance. Then the dynamics of the traffic flow is modeled as a 
diffusion process and the diffusion convolution operation is utilized to capture the 
spatial dependency. The overall Diffusion Convolutional Recurrent Neural Network 
(DCRNN) integrates diffusion convolution, the sequence to sequence architecture 
and the scheduled sampling technique. 

Denote the node information (e.g., traffic flow) observed on a graph ¥ as a graph 
signal F and let F’ represent the graph signal observed at time t, the temporal node 
transformation problem aims to learn a mapping from T’ historical graph signals 
to future T graph signals as: [FT +1... Ft; 4] > [F™!,...,F'+7;Y]. The spatial 
dependency is modeled by relating node information to a diffusion process, which 
is characterized by a random walk on Y with restart probability œ € [0,1] and a 
state transition matrix Do'W. Here Do is the out-degree diagonal matrix, and 1. 
After many time steps, such Markov process converges to a stationary distribution 
P € RN” whose i-th row represents the likelihood of diffusion from node v;. Thus, 
a diffusion convolutional layer can be defined as 


P 
Fig =f(Y Fp 9 fopa) 7 {1-0} (12.2) 
p=1 
where the diffusion convolution operation is defined as 
iS lyyyk k 
F. pxa fo = } (Qk (D3 W) + be2(D7'WT) Fp, — p E {I,...,P} (12.3) 
k=0 


Here the Dog and Dz refer to the out-degree and in-degree diagonal matrix respec- 
tively. P and Q refer to the feature dimension of the input and output node features 
at each diffusion convolution layer. The diffusion convolution is defined on both di- 
rected and undirected graphs. When applied to undirected graphs, the existing graph 
convolution neural networks (GCN) can be considered as a special case of diffusion 
convolution network. 

To deal with the temporal dependency during the node transformation process, 
the recurrent neural networks (RNN) or Gated Recurrent Unit (GRU) can be lever- 
aged. For example, by replacing the matrix multiplications in GRU with the diffu- 
sion convolution, the Diffusion Convolutional Gated Recurrent Unit (DCGRU) is 
defined as 
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r = 0(0,xy [F',H'"']+b!) (12.4) 
u = o (O, xy [F',H'’ |] +b‘) 

C' = Tanh(o(@, xg [F', (r’ © H’~')| +b‘)) 

H! =w" OH! +(1-u)OC', 


where X‘ and H' denote the input and output of all the nodes at time ż, r’ and u’ are 
reset gate and update gate at time t, respectively. xg denotes the diffusion convolu- 
tion defined in equation 12.3. @,,0,,@, are parameters for the corresponding filters 
in the diffusion network. 

Another typical spatio-temporal graph convolution network for spatial-temporal 
node transformation is proposed by (Yu et al, 2018a). This model comprises sev- 
eral spatio-temporal convolutional blocks, which are a combination of graph con- 
volutional layers and convolutional sequence learning layers, to model spatial and 
temporal dependencies. Specifically, the framework consists of two spatio-temporal 
convolutional blocks (ST-Conv blocks) and a fully-connected output layer in the 
end. Each ST-Conv block contains two temporal gated convolution layers and one 
spatial graph convolution layer in the middle. The residual connection and bottle- 
neck strategy are applied inside each block. The input sequence of node information 
is uniformly processed by ST-Conv blocks to explore spatial and temporal depen- 
dencies coherently. Comprehensive features are integrated by an output layer to gen- 
erate the final prediction. In contrast to the above mentioned DCGRU, this model is 
built completely from convolutional structures to capture both spatial and temporal 
patterns without any recurrent neural network; each block is specially designed to 
uniformly process structured data with residual connection and bottleneck strategy 
inside. 


12.3 Edge-level Transformation 


12.3.1 Definition of Edge-level Transformation 


Edge-level transformation aims to generate the graph topology and edge attributes of 
the target graph conditioning on the input graph. It requires the edge set & and edge 
attributes E to change while the graph node set and node attributes are fixed during 
the transformation: 7: %(V, és, F,Es) > Gr(Y,é7,F,Er). Edge transformation 
has a wide range of real-world applications, such as modeling chemical reactions 
(You et al, 2018a), protein folding (Anand and Huang, 2018) and malware cyber- 
network synthesis (Guo et al, 2018b). For example, in social networks where people 
are the nodes and their contacts are the edges, the contact graph among them varies 
dramatically across different situations. For example, when the people are organiz- 
ing a riot, it is expected that the contact graph to become denser and several special 
“hubs” (e.g., key players) may appear. Hence, accurately predicting the contact net- 
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work in a target situation is highly beneficial to situational awareness and resource 
allocation. 

Numerous efforts have been contributed to edge-level graph transformation. Here 
we introduce three typical methods in modelling the edge-level graph transforma- 
tion problem, including graph transformation generative adversarial networks (GT- 
GAN), multi-scale graph transformation networks (Misc-GAN), and graph transfor- 
mation policy networks (CTPN). 


12.3.2 Graph Transformation Generative Adversarial Networks 


Generative Adversarial Network (GANs) is an alternative method for generation 
problems. It is designed based on a game theory scenario called the min-max game, 
where a discriminator and a generator compete against each other. The generator 
generates data from stochastic noise, and the discriminator tries to tell whether it is 
real (coming from a training set) or fabricated (from the generator). The absolute 
difference between carefully calculated rewards from both networks is minimized 
so that both networks learn simultaneously as they try to outperform each other. 
GANs can be extended to a conditional model if both the generator and discrimi- 
nator are conditioned on some extra auxiliary information, such as class labels or 
data from other modalities. Conditional GANSs is realized by feeding the conditional 
information into the both the discriminator and generator as additional input layer. 
In this scenario, when the conditional information is a graph, the conditional GANs 
can be utilized to handle graph transformation problem to learn the mapping from 
the conditional graph (i.e., input graph) to the target graph (i.e., output graph). Here, 
we introduce two typical edge-level graph transformation techniques that are based 
on Conditional GANs. 

A novel Graph-Translation-Generative Adversarial Networks (GT-GAN) pro- 
posed by (Guo et al, 2018b) can successfully implement and learn the mapping 
from the input to target graphs. GT-GAN consists of a graph translator 7 and a 
conditional graph discriminator Y. The graph translator 7 is trained to produce 
target graphs that cannot be distinguished from “real” ones by our conditional graph 
discriminator Y. Specifically, the generated target graph Y = 7(Gs5,U) cannot be 
distinguished from the real one, Yr, based on the current input graph Ys. U refers to 
the random noises. Y and Y undergo an adversarial training process based on input 
and target graphs by solving the following loss function: 


LT, D) = Egara llog W(Fr|4s)| (12.5) 
+ gg (log (1 ~~ G(T (Gs,U)|Fs))], 


where .¥ refers to the dataset. 7 tries to minimize this objective while an adversar- 
ial 2 tries to maximize it, i.e. 7* = arg min z maxg Y( 7, 2). The graph translator 
includes two parts: graph encoder and graph decoder. A graph convolution neural net 
(Kawahara et al, 2017) is extended to serve as the graph encoder in order to embed 
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the input graph into node-level representations, while a new graph deconvolution 
net is designed as the decoder to generate the target graph. Specifically, the encoder 
consists of edge-to-edge and edge-to-node convolution layers, which first extract la- 
tent edge-level representations and then node-level representations {Hj a 1» Where 
H; € R? refers to the latent representation of node v;. The decoder consists of node- 
to-edge and edge-to-edge deconvolution layers to first get each edge representation 
Ê;, j based on H; and H;, and then finally get edge attribute tensor E based on Ê. 
Based on the graph deconvolution above, it is possible to utilize skips to link the 
extracted edge latent representations of each layer in the graph encoder with those 
in the graph decoder. 

Specifically, in the graph translator, the output of the /-th “edge deconvolution” 
layer in the decoder is concatenated with the output of the /-th “edge convolution” 
layer in the encoder to form joint two channels of feature maps, which are then 
input into the (/ + 1)-th deconvolution layer. It is worth noting that one key factor 
for effective translation is the design of a symmetrical encoder-decoder pair, where 
the graph deconvolution is a mirrored reversed way from graph convolution. This 
allows skip-connections to directly translate different level’s edge information at 
each layer. 

The graph discriminator is utilized to distinguish between the “translated” target 
graph and the “real” ones based on the input graphs, as this helps to train the gen- 
erator in an adversarial way. Technically, this requires the discriminator to accept 
two graphs simultaneously as inputs (a real target graph and an input graph or a 
generated graph and an input graph) and classify the two graphs as either related or 
not. Thus, a conditional graph discriminator (CGD) that leverages the same graph 
convolution layers in the encoder is utilized for the graph classification. Specifically, 
the input and target graphs are both ingested by the CGD and stacked into a tensor, 
which can be considered a 2-channel input. After obtaining the node representa- 
tions, the graph-level embedding is computed by summing these node embeddings. 
Finally, a softmax layer is implemented to distinguish the input graph-pair from the 
real graph or generated graph. 

To further handle the situation when the pairing information of the input and 
the output is not available, Gao et al (2018b) proposes an Unpaired Graph Trans- 
lation Generative Adversarial Nets (UGT-GAN) based on Cycle-GAN (Zhu et al, 
2017) and incorporate the same encoder and deconder in GI-GAN to handle the 
unpaired graph transformation problems. The cycle consistency loss is utilized and 
generalized into graph cycle consistency loss for unpaired graph translation. Specif- 
ically, graph cycle consistency adds an opposite direction translator from target to 
source domain J; : Yr — Gs by training the mappings for both directions simulta- 
neously, and adding a cycle consistency loss that encourages 7,.(.7(Gs)) ~ Gs and 
T(F(Gr)) ~ Gr. Combining this loss with adversarial losses on domains Yr and 
Gs yields the full objective for unpaired graph translation. 
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12.3.3 Multi-scale Graph Transformation Networks 


Many real-world networks typically exhibit hierarchical distribution over graph 
communities. For instance, given an author collaborative network, research groups 
of well-established and closely collaborated researchers could be identified by the 
existing graph clustering methods in the lower-level granularity. While, from a 
coarser level, we may find that these research groups constitute large-scale com- 
munities, which correspond to various research topics or subjects. Thus, it is neces- 
sary to capture the hierarchical community structures over the graphs for edge-level 
graph transformation problem. Here, we introduce a graph generation model for 
learning the distribution of the graphs, which, however, is formalized as a edge-level 
graph transformation problem. 

Based on GANSs, a multi-scale graph generative model, Misc-GAN, can be uti- 
lized to model the underlying distribution of graph structures at different levels of 
granularity. Inspired by the success of deep generative models in image translation, 
a cycle-consistent adversarial network (CycleGAN) (Zhu et al, 2017) is adopted to 
learn the graph structure distribution and then generate a synthetic coarse graph at 
each granularity level. Thus, the graph generation task can be realized by ”transfer- 
ring” the hierarchical distribution from the graphs in the source domain to a unique 
graph in the target domain. 

In this framework, the input graph is characterized as several coarse-grained 
graphs by aggregating the strongly coupled nodes with a small algebraic distance 
to form coarser nodes. Overall, the framework can be separated into three stages. 
First, the coarse-grained graphs at K levels of granularity are constructed from the 
input graph adjacent matrix As. The adjacent matrix of the coarse-grained graph 


AW E RNO SNC at the k-th layer is defined as follows: 
T T 
AW = pD pO Asp), pl), (12.6) 


where AQ) = As and P® € RY ONO isa coarse-grained operator for the kth level 
and N\) refers to the number of nodes of the coarse-grained graph at level k. In the 
next stage, each coarse-grained graph at each level k will be reconstructed back into 


a fine graph adjacent matrix AM ) e RNOxN as follows: 
k T -ij T 4 (k) plk- 
AM = RO RED APRE). RO, 27) 


. k) ylk) : 
where R® € RN “xN ig the reconstruction operator for the kth level. Thus all the 
reconstructed fine graphs at each layer are on the same scale. Finally, these graphs 
are aggregated into a unique one by a linear function to get the final adjacent matrix 


as follows: Ar = ae wk a) +D*I, where w* € R and b* € R are weights and bias. 


260 Xiaojie Guo, Shiyu Wang, Liang Zhao 


12.3.4 Graph Transformation Policy Networks 


Beyond the general framework for edge-level transformation problem, it is neces- 
sary to deal with some domain-specific problems which may need to incorporate 
some domain knowledge or information into transformation process. For example, 
the chemical reaction product prediction problem is a typical edge-level transforma- 
tion problem, where the input reactant and reagent molecules can be jointly repre- 
sented as input graphs, and the process of generating product molecules (i.e., output 
graphs) from reactant molecules can be formulated as a set of edge-level graph 
transformations. Formalizing the chemical reaction product prediction problem as a 
edge-level transformation problem is beneficial due to two reasons: (1) it can cap- 
ture and utilize the molecular graph structure patterns of the input reactants and 
reagents(i.e., atom pairs with changing connectivity); and (2) it can automatically 
choose from these reactivity patterns a correct set of reaction triples to generate the 
desired products. 

Do et al (2019) proposed a Graph Transformation Policy Network (GTPN), a 
novel generic method that combines the strengths of graph neural networks and re- 
inforcement learning, to learn reactions directly from data with minimal chemical 
knowledge. The GTPN originally aims to generate the output graph by formaliz- 
ing the graph transformation process as a Markov decision process and modifying 
the input source graph through several iterations. From the perspective of chemi- 
cal reaction side, the process of reaction product prediction can be formulated as 
predicting a set of bond changes given the reactant and reagent molecules as input. 
A bond change is characterized by the atom pair that holds the bond (where is the 
change) and the new bond type (what is the change). 

Mathematically, given a graph of reactant molecule as input graph, %s, they pre- 
dict a set of reaction triples which transforms s into a graph of product molecule 
Gr. This process is modeled as a sequence consisting of tuples like (¢’, vj, v;,b‘) 
where vi and vi are the selected nodes from node set at step £ whose connection 
needs to be modified, b' is the new edge type of (vj,v;) and ¢' is a binary signal 
that indicates the end of the sequence. Generally, at every step of the forward pass, 
GTPN performs seven major steps: 1) computing the atom representation vectors 
through message passing neural network (MPNN); 2) computing the most possible 
K reaction atom pairs; 3) predicting the continuation signal ¢’; 4) predicting the 
reaction atom pair (v},v)); 5) predicting a new bond type b' of this atom pair; 6) 
updating the atom representations; and 7) updating the recurrent state. 

Specifically, the above iterative process of edge-level transformation is formu- 
lated as a Markov Decision Process (MDP) characterized by a tuple (.7, 7, fp, fr,T), 
where X is a set of states, 2f is a set of actions, fp is a state transition function, fp is 
a reward function, and I" is a discount factor. Thus, the overall model is optimized 
via the reinforcement learning. Specifically, a state * € .Y is a immediate graph 
that is generated at the step t, and s° refers to the input graph. An action a’ € 2 per- 
formed at step t is represented as a tuple (¢", (vj, v,,b')). The action is composed of 


three consecutive sub-actions: predicting ¢’, (v;,v}) and b' respectively. In the state 
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transition part, If ¢* = 1, the current graph Y' is modified based on the reaction 
triple (v;,v,,b") to generate a new intermediate graph 4 +1. Regarding the reward, 
both immediate rewards and delayed rewards are utilized to encourage the model to 
learn the optimal policy faster. At every step t, if the model predicts (¢", (vj, v, b‘)) 
correctly, it will receive a positive reward for each correct sub-action. Otherwise, a 
negative reward is given. After the prediction process has terminated, if the gener- 
ated products are exactly the same as the ground-truth products, a positive delayed 
reward is also given, otherwise a negative reward. 

Different from the encoder-decoder frameworks of GT-GAN, GTPN is a typical 
example of reinforcement learning-based graph transformation network, where the 
target graph is generated by making modifications on the input graphs in a itera- 
tive way. Reinforcement learning (RL) is a commonly used framework for learning 
controlling policies and generation process by a computer algorithm, the so-called 
agent, through interacting with its environment. The nature of reinforcement learn- 
ing methods (1.e.,a sequential generation process) make it a suitable framework for 
graph transformation problems which sometime requires the step-by-step edits on 
the input graphs to generate the final target output graphs. 


12.4 Node-Edge Co-Transformation 


12.4.1 Definition of Node-Edge Co-Transformation 


Node-edge co-transformation (NECT) aims to generate node and edge attributes of 
the target graph conditioned on those of the input graph. It requires that both nodes 
and edges can vary during the transformation process between the source graph and 
the target graph as follows: %s(%, Es, Fs, Es) > Gr (IT, 6, Fr, Er). There are two 
categories of techniques used to assimilate the input graph to generate the target 
graph embedding-based and editing-based. 

Embedding-based NECT usually encodes the source graph into latent represen- 
tations using an encoder that contains higher-level information on the input graph 
which can then be decoded into the target graph by a decoder (Jin et al, 2020c, 
2018c; Kaluza et al, 2018; Maziarka et al, 2020b; Sun and Li, 2019). These meth- 
ods are usually based on either conditional VAEs (Sohn et al, 2015) or conditional 
GANSs (Mirza and Osindero, 2014). Three main techniques will be introduced in this 
section, including junction-tree variational auto-encoder, molecule cycle-consistent 
adversarial networks and directed acyclic graph transformation networks. 


12.4.1.1 Junction-tree Variational Auto-encoder Transformer 


The goal of molecule optimization, which is one of the important molecule genera- 
tion problems, is to optimize the properties of a given molecule by transforming it 
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into a novel output molecule with optimized properties. The molecule optimization 
problem is typically formalized as a NECT problem where the input graph refers to 
the initial molecule and the output graph refers to the optimized molecule. Both the 
node and edge attributes can change during the transformation process. 

The Junction-tree Variational Auto-encoder (JT-VAE) is motivated by the key 
challenge of molecule optimization in the domain of drug design, which is to find 
target molecules with the desired chemical properties (Jin et al, 2018a). In terms of 
the model architecture, JT-VAE extends the VAE (Kingma and Welling, 2014) to 
molecular graphs by introducing a suitable encoder and a matching decoder. Under 
JT-VAE, each molecule is interpreted as being formalized from subgraphs chosen 
from a dictionary of valid components. These components serve as building blocks 
when encoding a molecule into a vector representation and decoding latent vectors 
back into optimized molecular graphs. The dictionary of components, such as rings, 
bonds and individual atoms, is large enough to ensure that a given molecule can 
be covered by overlapping clusters without forming cluster cycles. In general, JT- 
VAE generates molecular graphs in two phases, by first generating a tree-structured 
scaffold over chemical substructures and then combining them into a molecule with 
a graph message-passing network. 

The latent representation of the input graph Y is encoded by a graph message- 
passing network (Dai et al, 2016; Gilmer et al, 2017). Here, let x, denote the feature 
vector of the vertex v, involving properties of the vertex such as the atom type and 
valence. Similarly, each edge (u,v) € & has a feature vector x,,, indicating its bond 
type. Two hidden vectors v,,, and v,,, denote the message from u to v and vice versa. 
In the encoder, messages are exchanged via loopy belief propagation: 


vi) = t(Wfxu+Wfxw+ We E vir), (12.8) 
wEN(u)\v 


uy 1S the message computed in the t-th iteration, initialized with v® =0, T(-) 
is the ReLU function, wë ; ws and wè are weights, and N (u) denotes the neighbors 
of u. Then, after T iterations, the latent vector of each vertex is generated capturing 
its local graphical structure: 


where v’ 


h, = t(Ufxu+ $ Uv), (12.9) 
vEN(u) 
where U? and UŽ are weights. The final graph representation is hy = Y;h;/|V|, 
where || is the number of nodes in the graph. The corresponding latent variable 
zg can be sampled from M“ (zg; Ug, oF) and ug and or can be calculated from hy 
via two separate affine layers. 

A junction tree can be represented as (Y, £, 2) whose node set is V = (C1, ...,Cn) 
and edge set is & = (F},...,E,). This junction tree is labeled by the label dictionary 
X . Similar to the graph representation, each cluster C; is represented by a one-hot x; 
and each edge (C;,C;) corresponds to two message vectors v;; and v,;. An arbitrary 
leaf node is picked as the root and messages are propagated in two phases: 
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sj= $ Va (12.10) 
keN(i)\j 
Zij = O(W*x; + U*s;; +b*) 
rgi = O(W"X) +U"vgi +b") 
Vij= tanh(Wx; +U £ rki © Vki) 
keN(i)\j 
Vij = (1-24) OSjj +21; O Vij. 


hj, the latent representation of node v; can now be calculated: 


h;=t(W°xi+ }, U°vy) (12.11) 
kEN(u) 


The final tree representation is hg, = hyo. Zz, is sampled in a similar way as in 
the encoding process. 

Under the JT-VAE framework, the junction tree is decoded from Zg, using a 
tree-structured decoder that traverses the tree from the root and generates nodes in 
their depth-first order. During this process, a node receives information from other 
nodes, and this information is propagated through message vectors h;;. Formally, 
let & = {(i1,j1),---; (ins jm)} be the set of edges traversed over the junction tree 
(V,&), where m = 2|&| because each edge is traversed in both directions. The 
model visits node i; at time t. Let & be the first t edges in &. The message is updated 
as h; j = GRU (xi, , {hri } (kieg ksj)» Where Xi, corresponds to the node features. 
The decoder first makes a prediction regarding whether the node i; still has children 
to be generated, in which the probability is calculated as: 


pi =o(u"- (Wi x;, + Wizz, +W} Y} hyi,)), (12.12) 
(kin )eg 


where uf, we ; ws and we are weights. Then, when a child node j is generated from 
its parent i, its node label is predicted with: 


qj = softmax(U! - (Wizz, +W3hi;)), (12.13) 


where U’, A and WA are weights and qy; is a distribution over label dictionary 2”. 
The final step of the model is to reproduce a molecular graph ¥ to represent 
the predicted junction tree (P g ) by assembling the subgraphs together into the 
final molecular graph. Let 4 ( Zg) be a set of graphs corresponding to the junction 
tree Zg. Decoding graph @G from the junction tree Bg = (f, ê ) is a structured 
prediction: A 
G= ang, mae Fg), (12.14) 
G'=4( Iz) 
where f* is a scoring function over a graphs. The decoder starts by sampling 
the assembly of the root and its neighbors according to their scores, then proceeds to 
assemble the neighbors and associated clusters. In terms of scoring the realization 
of each neighborhood, let Y be the subgraph resulting from a particular merging of 
cluster C; in the tree with its neighbors Cj, j E N 4 Sy (i). GY is scored as a candidate 


subgraph by first deriving a vector representation hg, and f"(G) = hg, : zg is the 
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subgraph score. For atoms in %, let œ, =i if v € C; and a, = j if v € C;\C; to mark 
the position of atoms in the junction tree and retrieve messages m; j, summarizing 
the subtree under i along the edge (i,j) obtained by the tree encoder. Then the 
neural messages can be obtained and aggregated similarly to the encoding step with 
parameters: 


uÉ) = t(Wêxu + WE Xu + WE?) (12.15) 
~(t—1) Lwen(u)\v nme Qu = Ay 
Luv > = (t— 1) 

Mey, Oy + wen (u)\v Uwu Oy £ Oy, 


where Wf, W5 and W% are weights. 


12.4.1.2 Molecule Cycle-Consistent Adversarial Networks 


Cycle-consistent adversarial networks, an alternative to achieve embedding-based 
NECT, were originally developed to achieve image-to-image transformations. The 
aim here is to learn to transform an image from a source domain to a target domain in 
the absence of paired examples by using an adversarial loss. To promote the chemi- 
cal compound design process, this idea has been borrowed for graph transformation. 
For instance, Molecule Cycle-Consistent Adversarial Networks (Mol-CycleGAN) 
have been proposed to generate optimized compounds with high structural similar- 
ity to the originals (Maziarka et al, 2020b). Given a molecule Gy with the desired 
molecular properties, Mol-CycleGAN aims to train a model to perform the trans- 
formation G : x — GY and then use this model to optimize the molecules. Here 
Gy is the set of molecules without the desired molecular properties. In order to rep- 
resent the sets Gy and Y, this model requires a reversible embedding that allows 
both the encoding and decoding of molecules. To achieve this, JT-VAE is employed 
to provide the latent space during the training process, during which the distance 
between molecules required to calculate the loss function can be defined directly. 
Each molecule is represented as a point in latent space, assigned based on the mean 
of the variational encoding distribution. 

For the implementation, the sets Jy and fy must be defined (e.g., inactive/active 
molecules), after which the mapping functions G : G > YW and F : y — Y are 
introduced. The discriminators Dy and Dy are proposed to force generators F and 
G to generate samples from a distribution close to the distributions of Y and Y. 
For this process, F, G, Dy and Dy are modeled by neural networks. This approach 
to molecule optimization is designed to (1) take a prior molecule x with no specified 
features from set GY and compute its latent space embedding; (2) use generative 
neural network G to obtain the embedding of molecule G(x) that has this feature but 
is also similar to the original molecule x; and (3) decode the latent space coordinates 
given by G(x) to obtain the optimized molecule. 

The loss function to train Mol-CycleGAN is: 
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L(G, F,Dx,Dy) = Loan (G, Dy ,4%&,W) + Loan (F, Dx, S, Sx) (12.16) 
+A Leye(G,F) + AgLidentity(G,F), 


and G*, F* = arg ming,F Maxpy py L(G,F, Dx ,Dy). The adversarial loss is utilized: 


1 
Loan (G,Dy, Gx, fr) = 5E, 4, [(Dy (y) — 1) (12.17) 
~ data 
1 a 2 
E enpi Pr (GC) i, 


which ensures that the generator G (and F) generates samples from a distribution 
close to the distribution of Y (or Zx), denoted by pr a (Or pix a)- The cycle con- 
sistency loss 


Leye(G,F) =E,, y (IG(FQ)) -ylh (12.18) 
IFG) — ath] 


+ 


“emp | 


reduces the space available to the possible mapping functions such that for a 
molecule x from set %, the GAN cycle constrains the output to a molecule similar 
to x. The inclusion of the cyclic component acts as a regularization factor, making 
the model more robust. Finally, to ensure that the generated molecule is close to the 
original, identity mapping loss is employed: 


Lidemity(G,F) =E, vy [IlFO) —ylhi] (12.19) 
+E p% IlG) — xla], 
: data 


which further reduces the space available to the possible mapping functions and 
prevents the model from generating molecules that lay far away from the starting 
molecule in the latent space of JT-VAE. 


12.4.1.3 Directed Acyclic Graph Transformation Networks 


Another example of embedding-based NECT is a neural model for learning deep 
functions on the space of directed acyclic graphs (DAGs) (Kaluza et al, 2018). Math- 
ematically, the neural methodologies developed to handle graph-structured data can 
be regarded as function approximation frameworks where both the domain and the 
range of the target function can be graph spaces. In the area of interest here, the 
embedding and synthesis methodologies are gathered into a single unified frame- 
work such that functions can be learned from one graph space onto another graph 
space without the need to impose a strong assumption of independence between the 
embedding and generative process. Note that only functions in DAG space are con- 
sidered here. A general encoder-decoder framework for learning functions from one 
DAG space onto another has been developed. 
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Here, RNN is employed to model the function F, denoted as D2DRNN. Specif- 
ically, the model consists of an encoder Eg with model parameters œ that compute 
a fixed-size embedding of the input graph n, and a decoder Dg with parameters 
B, using the embedding as input and producing the output graph CA Alternatively, 
the DAG-function can be defined as F (Gin) := Dg (Ea (in) ). 

The encoder is borrowed from the deep-gated DAG recursive neural network 
(DG-DAGRNN) (Amizadeh et al, 2018), which generalizes the stacked recurrent 
neural networks (RNNs) on sequences to DAG structures. Each layer of DG- 
DAGRNN consists of gated recurrent units (GRUs) (Cho et al, 2014a), which are 
repeated for each node v; € Y,. The GRU corresponding to node v contains an ag- 
gregated representation of the hidden states of the units regarding its predecessors 
z(v). For an aggregation function A: 


h, = GRU (x,,h’,), where v’ = A({hy|u € 2(v)}). (12.20) 


Since the ordering of the nodes is defined by the topological sort of Gn, all the 
hidden states h, can be computed with a single forward pass along a layer of DG- 
DAGRNN. The encoder contains multiple layers, each of which passes hidden states 
to the recurrent units in the subsequent layer corresponding to the same node. 

The encoder outputs an embedding Hi, = Ea (Sin), which serves as the input of 
the DAG decoder. The decoder follows the local-based node-sequential generation 
style. Specifically, first, the number of nodes of the target graph is predicted by a 
multilayer perceptron (MLP) with a Poisson regressor output layer, which takes the 
input graph embedding H;n and outputs the mean of a Poisson distribution describ- 
ing the output graph. Whether it is necessary to add an edge e,,,, for all the nodes 
u € {v1,...,Vn—1} already in the graph is determined by a module of MLP. Since the 
output nodes are generated in their topological order, the edges are directed from 
the nodes added earlier to the nodes added later. For each node v, the hidden state 
h, is calculated using a similar mechanism to that used in the encoder, after which 
they are aggregated and fed to a GRU. The other input for the GRU consists of the 
aggregated states of all the sink nodes generated so far. For the first node, the hidden 
state is initialized based on the encoder’s output. Then, the output node features are 
generated based on its hidden state using another module of MLP. Finally, once the 
last node has been generated, the edges are introduced with probability 1 for sinks 
in the graph to ensure a connected graph with only one sink node as an output. 


12.4.2 Editing-based Node-Edge Co-Transformation 


Unlike the encoder-decoder framework, modification-based NECT directly mod- 
ifies the input graph iteratively to generate the target graphs (Guo et al, 2019c; 
You et al, 2018a; Zhou et al, 2019c). Two methods are generally used to edit the 
source graph. One employs a reinforcement-learning agent to sequentially modify 
the source graph based on a formulated Markov decision process (You et al, 2018a; 
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Zhou et al, 2019c). The modification at each step is selected from a defined action 
set that includes add node”, ”add edge”, ’remove bonds” and so on. Another is 
to update the nodes and edges from the source graph synchronously in a one-shot 
manner through the MPNN using several iterations (Guo et al, 2019c). 


12.4.2.1 Graph Convolutional Policy Networks 


Motivated by the large size of chemical space, which can be an issue when design- 
ing molecular structures, graph convolutional policy networks (GCPNs) serve as 
useful general graph convolutional network-based models for goal-directed graph 
generation through reinforcement learning (RL) (You et al, 2018a). In this model, 
the generation process can be guided towards the specific desired objectives, while 
restricting the output space based on underlying chemical rules. To achieve goal- 
directed generation, three strategies, namely graph representation, reinforcement 
learning, and adversarial trainings are adopted. In GCPN, molecules are represented 
as molecular graphs, and partially generated molecular graphs can be interpreted as 
substructures. GCTN is designed as an RL agent which operates within a chemistry- 
aware graph generation environment. A molecule is successively constructed by ei- 
ther connecting a new substructure or atom to an existing molecular graph by adding 
a bond. GCPN is trained to optimize the domain-specific properties of the source 
molecule by applying a policy gradient to optimize it via a reward composed of 
molecular property objectives and adversarial loss; it acts in an environment which 
incorporates domain-specific rules. The adversarial loss is provided by a GCN-based 
discriminator trained jointly on a dataset of example molecules. 

An iterative graph generation process is designed and formulated as a general 
decision process M = (Z, æ% ,P,R, Y), where Z = {s;} is the set of states that com- 
prises all possible intermediate and final graphs. ./ = (a;) is the set of actions 
that describe the modifications made to the current graph during each iteration, 
P represents the transition dynamics that specify the possible outcomes of carry- 
ing out an action p(s;+1|5;,...,50,a:), R(s;) = r; is a reward function that specifies 
the reward after reaching state s, and y is the discount factor. The graph genera- 
tion process can now be formulated as (s0,40,70,---;5n;4n;'n), and the modifica- 
tion of the graph at each time can be described as a state transition distribution: 
P(S141|8t5 +580) = La, P(Ar|S1,---, 80) P (8141815 --- 80,41), Where p(a;|s;,...,50) is rep- 
resented as a policy network 7g. Note that in this process, the state transition dy- 
namics are designed to satisfy the Markov property p(s;+1|8;,.--80) = p(5++41|s;)- 

In this model, a distinct, fixed-dimension, homogeneous action space is defined 
and amenable to reinforcement learning, where an action is analogous to link pre- 
diction. Specifically, a set of scaffold subgraphs {C},...,C,} is first defined based on 
the source graph, thus serving as a subgraph vocabulary that contains the subgraphs 
to be added into the target graph during graph generation. Define C = U;_,C;. Given 
the modified graph Y at step t, the corresponding extended graph can be defined as 
UC. Under this definition, an action can either correspond to connecting a new 
subgraph C; to a node in Y or connecting existing nodes within graph Y. GAN is 
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also employed to define the adversarial rewards to ensure that generated molecules 
do indeed resemble the originals. 

Node embedding is achieved by message passing over each edge type for L layers 
through GCN. At the /-th layer of GCN, messages from different edge types are ag- 
gregated to calculate the node embedding H ( +1) ER(+¢)xk of the next layer, where 
nand c are the sizes of Y and C, respectively, and k is the embedding dimension: 


1 1 
H+) — AGG(ReLU ({D, 7 ÊD 7HOW,”}, Vie (1,...,b))). (12.21) 


E; is the i" slice of the edge-conditioned adjacency tensor E, and Ê; = E; +1; Ô; = 
L Êi jk and wi) is the weight matrix for the i” edge type. AGG denotes one of the 
aggregation functions from {MEAN,MAX ,SUM,CONTACT }. 

The link prediction-based action a; ensures each component samples from a pre- 
diction distribution governed by the equations below: 


a, = CONCAT (a first > second, ledge, Astop) (12.22) 
F first (1) = softmax(m,(X)), Arirst ~ Ff first (St) € {0,1 }" (12.23) 
Fsecond (s+) = so ftmax(ms (Xa pins ,X)), second ~ fsecond (s+) € {0, ce as 
Seage(S+) = so ftmax(Me (Xa pis X)) edge ~ fedge(St) € {0, 1% 
fstop (St) = so ftmax(ny (AGG(X))), Astop ™ Sstop (st) E {0, 1} 


Here mf, ms, Me and mf denote MLP modules. 


12.4.2.2 Molecule Deep Q-networks Transformer 


In addition to GCPN, molecule deep Q-networks (MoIDQN) has also been devel- 
oped for molecule optimization under the node-edge co-transformation problem uti- 
lizing an editing-based style. This combines domain knowledge of chemistry with 
state-of-the-art reinforcement learning techniques (double Q-learning and random- 
ized value functions) (Zhou et al, 2019c). In this field, traditional methods usually 
employ policy gradients to generate graph representations of molecules, but these 
suffer from high variance when estimating the gradient (Gu et al, 2016). In com- 
parison, MoIDQN is based on value function learning, which is usually more stable 
and sample efficient. MoIDQN also avoids the need for expert pretraining on some 
datasets, which may lead to lower variance but limits the search space considerably. 

In the framework proposed here, modifications of molecules are directly defined 
to ensure 100% chemical validity. Modification or optimization is performed in a 
step-wise fashion, where each step belongs to one of the following three categories: 
(1) atom addition, (2) bond addition, and (3) bond removal. Because the molecule 
generated depends solely on the molecule being changed and the modification made, 
the optimization process can be formulated as a Markov decision process (MDP). 
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Specifically, when performing the action atom addition, an empty set of atoms Yr 
for the target molecule graph is first defined. Then, a valid action is defined as adding 
an atom in %7 and also a bond between the added atom and the original molecule 
wherever possible. When performing the action bond addition, a bond is added be- 
tween two atoms in %7. If there is no existing bond between the two atoms, the 
actions between them can consist of adding a single, double or triple bond. If there 
is already a bond, this action changes the bond type by increasing the index of the 
bond type by one or two. When performing the action bond removal, the valid bond 
removal action set is defined as the actions that decrease the bond type index of an 
existing bond. Possible transitions include: (1) Triple bond — {Double, Single, No} 
bond, (2) Double bond > {Single, No} bond, and (3) Single bond + {No} bond. 
Based on the molecule modification MDP defined above, RL aims to find a policy 
a that chooses an action for each state that maximizes future rewards. Then, the 
decision is made by finding the action a for a state s to maximize the Q function: 


T 
Q*(s,a) = Q” (m,t,a) = Ez[}_ al, (12.24) 


where r, is the reward at step n. The optimal policy can therefore be defined as 
n*(s) = argmax,Q” (s,a). A neural network is adopted to approximate Q(s,a, 0), 
and can be trained by minimizing the loss function: 


1(0) = E[fi (yı — Q(s1,a1;0))], (12.25) 


where y; = r; + maxa Q(s++1,4;0) is the target value and f; is the Huber loss: 


1,2 : 
a= i if |x] < 1 (12.26) 


|x| — 1 otherwise 


In a real-world setting, it is usually desirable for several different properties to be 
optimized at the same time. Under the multi-objective RL setting, the environment 
will return a vector of rewards at each step t with one reward for each objective. A 
“scalar” reward framework is applied to achieve multi-objective optimization, with 
the introduction of a user defined weight vector w = [w1,w2, „p wg] l € R*. The 
reward is calculated as: 


k 
rsp =W T} =) wiriz. (12.27) 
i=1 


The objective of MDP is to maximize the cumulative scalarized reward. 

The Q-learning model (Mnih et al, 2015) is implemented here, incorporating the 
improvements gained using double Q-learning (Van Hasselt et al, 2016), with a deep 
neural network being used to approximate the Q-function. The input molecule is 
converted to a vector, by taking the form of a Morgan fingerprint (Rogers and Hahn, 
2010) with the radius of 3 and length of 2048. The number of steps remaining in 
the episode is concatenated to the vector and a four-layer fully-connected network 
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with hidden state size of [1024, 512, 128, 32] and ReLU activation is used as the 
architecture. 


12.4.2.3  Node-Edge Co-evolving Deep Graph Translator 


To overcome a number of challenges including, but not limited to, the mutually 
dependent translation of the node and edge attributes, asynchronous and iterative 
changes in the node and edge attributes during graph translation, and the difficulty of 
discovering and enforcing the correct consistency between node attributes and graph 
spectra, the Node-Edge Co-evolving Deep Graph Translator (NEC-DGT) has been 
developed to achieve so-called multi-attributed graph translation and proven to be a 
generalization of the existing topology translation models (Guo et al, 2019c). This is 
a node-edge co-evolving deep graph translator that edits the source graph iteratively 
through a generation process similar to the MPNN-based adjacency-based one-shot 
method for unconditional deep graph generation, with the main difference being 
that it takes the graph in the source domain as input rather than the initialized graph 
(Guo et al, 2019c). 

NEC-DGT employs a multi-block translation architecture to learn the distribu- 
tion of the graphs in the target domain, conditioning on the input graphs and con- 
textual information. Specifically, the inputs are the node and graph attributes, and 
the model outputs are the generated graphs’ node and edge attributes after several 
blocks. A skip-connection architecture is implemented across the different blocks to 
handle the asynchronous properties of different blocks, ensuring the final translated 
results fully utilize various combinations of blocks’ information. The following loss 
function is minimized in the work: 


Ly = £(F(G(Eo,Fo),C),9(E',F’)), (12.28) 


where C corresponds to the contextual information vector, Eo, E’ corresponds to 
the edge attribute tensors of the input and target graphs, respectively, and Fo, F’ 
corresponds to the node attribute tensors of the input and target graphs, respectively. 

To jointly handle the various interactions among the nodes and edges, the re- 
spective translation paths are considered for each block. For example, in the node 
translation path, edges-to-nodes and nodes-to-nodes interactions are considered in 
the generation of node attributes. Similarly, ”node to edges” and ”edges-to-edges” 
are considered in the generation of edge attributes. 

The frequency domain properties of the graph are learned, by which the inter- 
actions between node and edge attributes are jointly regularized utilizing a non- 
parametric graph Laplacian. Also, shared patterns among the generated nodes and 
edges in different blocks are enforced through regularization. Then, the regulariza- 
tion term is 


S 
A(G (E,F)) = } Zo (2 (Es, Fs)) + Zo, (12.29) 
s=0 
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where S corresponds to the number of blocks and @ refers to the overall parameters 
in the spectral graph regularization. 4 (Eş, Fs) is the generated target graph, where 
Es is the generated edge attributes tensor and Fs is the node attributes matrix. Then 
the total loss function is 


P = LT (GY(Ep, Fo),C),G(E',F')) + B&(G(E,F)). (12.30) 


The model is trained by minimizing the MSE of Es with E’, Fs with F’, enforced 
by the regularization. T(-) is the mapping from the input graph to the target graph 
learned from the multi-attributed graph translation. 

The transformation process is modeled by several stages with each stage gen- 
erating an immediate graph. Specifically, for each stage t, there are two options: 
node translation paths and edge translation paths. In the node translation path, an 
MLP-based influence-function is used to calculate the influence 1) on each node 
v; from its neighboring nodes. Another MLP-based updating-function is used to up- 
date the node attribute as FO with the input of influence 1, The edge translation 
path is constructed in the same way as the node translation path, with each edge 
being generated by the influence from its adjacent edges. 


12.5 Other Graph-based Transformations 


12.5.1 Sequence-to-Graph Transformation 


A deep sequence-to-graph transformation aims to generate a target graph Gr condi- 
tioned on an input sequence X. This problem is often seen in domains such as NLP 
(Chen et al, 2018a; Wang et al, 2018g) and time series mining (Liu et al, 2015; Yang 
et al, 2020c). 

Existing methods (Chen et al, 2018a; Wang et al, 2018g) handle the seman- 
tic parsing task by transforming a sequence-to-graph problem into a sequence-to- 
sequence problem and utilizing the classical RNN-based encoder-decoder model 
to learn this mapping. A neural semantic parsing approach, named Sequence-to- 
Action, models semantic parsing as an end-to-end semantic graph generation pro- 
cess (Chen et al, 2018a). Given a sentence X = {x1,...,Xm}, the Sequence-to-Action 
model generate a sequence of actions Y = {y1,.., Ym} when constructing the cor- 
rect semantic graph. A semantic graph consists of nodes (including variables, enti- 
ties, and types) and edges (semantic relationships), with universal operations (e.g., 
argmax, argmin, count, sum, and not). To generate a semantic graph, six types of ac- 
tions are defined: Add Variable Node, Add Entity Node, Add Type Node, Add Edge, 
Operation Function and Argument Action. In this way, the generated parse tree is 
represented as a sequence, and the sequence-to-graph problem is transformed into 
a sequence-to-sequence problem. The attention-based sequence-to-sequence RNN 
model with an encoder and decoder can be utilized, where the encoder converts the 
input sequence X to a sequence of context sensitive vectors {b1,...,b,,} using a bidi- 
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rectional RNN and a classical attention-based decoder generates action sequence Y 
based on the context sensitive vectors (Bahdanau et al, 2015). The generation of a 
parse tree as a sequence of actions is represented (Wang et al, 2018g) and concepts 
from the Stack-LSTM neural parsing model are borrowed, producing two non-trivial 
improvements, Bi-LSTM subtraction and incremental tree-LSTM, that improve the 
process of learning a sequence-to-sequence mapping (Dyer et al, 2015). 

Other methods have also been developed to handle the problem of Time Series 
Conditional Graph Generation (Liu et al, 2015; Yang et al, 2020c): given an input 
multivariate time series, the aim is to infer a target relation graph to model the under- 
lying interrelationship between the time series and each node. A novel model of time 
series conditioned graph generation-generative adversarial networks (TSGG-GAN) 
for time series conditioned graph generation has been proposed that explores the 
use of GANS in a conditional setting (Yang et al, 2020c). Specifically, the generator 
in a TSGG-GAN adopts a variant of recurrent neural networks known as simple re- 
current units (SRU) (Lei et al, 2017b) to extract essential information from the time 
series, and uses an MLP to generate the directed weighted graph. 


12.5.2 Graph-to-Sequence Transformation 


A number of graph-to-sequence encoder-decoder models have been proposed to 
handle rich and complex data structures, which are hard for sequence-to-sequence 
methods to handle (Gao et al, 2019c; Bastings et al, 2017; Beck et al, 2018; Song 
et al, 2018; Xu et al, 2018c). A graph-to-sequence model typically employs a graph- 
neural-network-based (GNN-based) encoder and an RNN/Transformer-based de- 
coder, with most being designed to tackle tasks such as natural language genera- 
tion (NLG), which is an important task in NLP (YILMAZ et al, 2020). Graph-to- 
sequence models have the ability to capture the rich structural information of the 
input and can also be applied to arbitrary graph-structured data. 

Early graph-to-sequence methods and their follow-up works (Bastings et al, 
2017; Damonte and Cohen, 2019; Guo et al, 2019e; Marcheggiani et al, 2018; Xu 
et al, 2020b,d; Zhang et al, 2020d,c) have mainly used a graph convolutional net- 
work (GCN) (Kipf and Welling, 2017b) as the graph encoder, probably because 
GCN was the first widely used GNN model that sparked this new wave of research 
on GNNs and their applications. Early GNN variants, such as GCN, were not orig- 
inally designed to encode information on the edge type and so cannot be directly 
applied to the encoding of multi-relational graphs in NLP. Later on, more graph 
transformer models (Cai and Lam, 2020; Jin and Gildea, 2020; Koncel-Kedziorski 
et al, 2019) were introduced to the graph-to-sequence architecture to handle these 
multi-relational graphs. These graph transformer models generally function by ei- 
ther replacing the self-attention network in the original transformer with a masked 
self-attention network, or explicitly incorporating edge embeddings into the self- 
attention network. 
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Because edge direction in an NLP graph often encodes critical information re- 
garding semantic meanings, capturing bidirectional information in the text is helpful 
and has been widely explored in works such as BiLSTM and BERT (Devlin et al, 
2019). Some attention has also been devoted to extending the existing GNN models 
to handle directed graphs. For example, separate model parameters can be intro- 
duced for different edge directions (e.g., incoming/outgoing/self-loop edges) when 
conducting neighborhood aggregation (Guo et al, 2019e; Marcheggiani et al, 2018; 
Song et al, 2018). A BiLSTM-like strategy has also been proposed to learn the node 
embeddings of each direction independently using two separate GNN encoders and 
then concatenating the two embeddings for each node to obtain the final node em- 
beddings (Xu et al, 2018b,c,d). 

In the field of NLP, graphs are usually multi-relational, where the edge type in- 
formation is vital for the prediction. Similar to the bidirectional graph encoder in- 
troduced above, separate model parameters for different edge types are considered 
when encoding edge type information with GNNs (Chen et al, 2018e; Ghosal et al, 
2020; Schlichtkrull et al, 2018). However, usually the total number of edge types 
is large, leading to non-negligible scalability issues for the above strategies. This 
problem can be tackled by converting a multi-relational graph to a Levi graph (Levi, 
1942), which is bipartite. To create a Levi graph, all the edges in the original graph 
are treated as new nodes and new edges are added to connect the original nodes and 
new nodes. 

Apart from NLP, graph-to-sequence transformation has been employed in other 
fields, for example when modeling complex transitions of an individual user’s ac- 
tivities among different healthcare subforums over time and learning how this is 
related to his various health conditions (Gao et al, 2019c). By formulating the tran- 
sition of user activities as a dynamic graph with multi-attributed nodes, the health 
stage inference is formalized as a dynamic graph-to-sequence learning problem and, 
hence, a dynamic graph-to-sequence neural network architecture (DynGraph2Seq) 
has been proposed (Gao et al, 2019c). This model contains a dynamic graph en- 
coder and an interpretable sequence decoder. In the same work, a dynamic graph 
hierarchical attention mechanism capable of capturing entire both time-level and 
node-level attention is also proposed, providing model transparency throughout the 
whole inference process. 


12.5.3 Context-to-Graph Transformation 


Deep graph generation conditioning on semantic context aims to generate the target 
graph Yr conditioning on an input semantic context that is usually represented in 
the form of additional meta-features. The semantic context can refer to the category, 
label, modality, or any additional information that can be intuitively represented as 
a vector C. The main issue here is to decide where to concatenate or embed the con- 
dition representation into the generation process. As a summary, the conditioning 
information can be added in terms of one or more of the following modules: (1) 
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the node state initialization module, (2) the message passing process for MPNN- 
based decoding, and (3) the conditional distribution parameterization for sequential 
generating. 

A novel unified model of graph variational generative adversarial nets has been 
proposed, where the conditioning semantic context is input into the node state ini- 
tialization module (Yang et al, 2019a). Specifically, the generation process begins 
by modeling the embedding Z; of each node with the separate latent distributions, 
after which a conditional graph VAE (CGVAE) can be directly constructed by con- 
catenating the condition vector C to each node’s latent representation Z; to obtain 
the updated node latent representation Z;. Thus, the distribution of the individual 
edge 6;,; is assumed to be a Bernoulli distribution, which is parameterized by the 
value Â ; and calculated as &;; = Sigmoid(f(Z;)' f(Z;)), where f(-) is constructed 
using a few fully connected layers. A conditional deep graph generative model that 
adds the semantic context information into the initialized latent representations Z; 
at the beginning of the decoding process has also been proposed (Li et al, 2018d). 

Other researchers have added the context information C into the message passing 
module as part of its MPNN-based decoding process (Li et al, 2018f). Specifically, 
the decoding process is parameterized as a Markov process and the graph is gen- 
erated by iteratively refining and updating the initialized graph. At each step t, an 
action is conducted based on the current node’s hidden states H' = {h} ,..., hiy }. To 
calculate hi € IR! (l denotes the length of the representation) for node v; in the in- 
termediate graph %, after each updating of the graph, a message passing network 
is utilized with node message propagation. Thus, the context information C € R* is 
added to the operation of the MPNN layer as follows: 


E t—1 t—1 
h; =Wh; +8}, eng hy +OC, (12.31) 


where W € R!*!, @ € R!*! and @ € R**! are all learnable weights vectors and k 
denotes the length of the semantic context vector. 

Semantic context has also been considered as one of the inputs for calculating the 
conditional distribution parameter at each step during the sequential generating pro- 
cess (Jonas, 2019). The aim here is to solve the molecule inverse problem by infer- 
ring the chemical structure conditioning on the formula and spectra of a molecule, 
which provides a distinguishable fingerprint of its bond structure. The problem is 
framed as an MDP and molecules are constructed incrementally one bond at a time 
based on a deep neural network, where they learn to imitate a “subisomorphic or- 
acle” that knows whether the generated bonds are correct. The context information 
(e.g., spectra) is applied in two places. The process begins with an empty edge set 
éo that is sequentially updated to & at each step k by adding an edge sampled 
from p(ei j|ék-1,%,C). VY denotes the node set that is defined in the given molec- 
ular formula. The edge set keeps updating until the existing edges satisfy all the 
valence constraints of a molecule. The resulting edge set éx then serves as the can- 
didate graph. For a given spectrum C, the process is repeated T times, generating 
T (potentially different) candidate structures, {Eo Ejs Then based on a spectral 
prediction function f(-), the quality of these candidate structures are evaluated by 
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measuring how close their predicted spectra are to the condition Peo C. Finally, 
the optimal generated graph is selected according to gremin IF CH Da C |l2. 


12.6 Summary 


In this chapter, we introduce the definitions and techniques for the transforma- 
tion problem that involves graphs in the domain of deep graph neural networks. 
We provide a formal definition of the general deep graph transformation prob- 
lem as well as its four sub-problems, namely node-level transformation, edge-level 
transformation,node-edge co-transformation, as well as other graph-involved trans- 
formations (e.g., sequence-to-graph transformation and context-to-graph transfor- 
mation). For each sub-problem, its unique challenges and several representative 
methods are introduced. As an emerging research domain, there are still many 
open problems to be solved for future exploration, including but not limited to: 
(1) Improved scalability. Existing deep graph transformation models typically have 
super-linear time complexity to the number of nodes and cannot scale well to large 
networks. Consequentially, most existing works merely focus on small graphs, typi- 
cally with dozens to thousands of nodes. It is difficult for them to handle many real- 
world networks with millions to billions of nodes, such as the internet of things, 
biological neuronal networks, and social networks. (2) Applications in NLP. As 
more and more GNN-based works have advanced the development of NLP, graph 
transformation is naturally a good fit for addressing some NLP tasks, such as in- 
formation extraction and semantic parsing. For example, information extraction can 
be formalized into a graph-to-graph problem where the input graph is the depen- 
dency graph and the output graph is the information graph. (3) Explainable graph 
transformation. When we learn the underlying distribution of the generated target 
graphs, learning interpretable representations of graph that expose semantic mean- 
ing is very important. For example, it is highly beneficial if we could identify which 
latent variable(s) control(s) which specific properties (e.g., molecule mass) of the 
target graphs (e.g., molecules). Thus, investigations on the explainable graph trans- 
formation process are critical yet unexplored. 


Editor’s Notes: Graph transformation is deemed very relevant to graph gen- 
eration (see Chapter 11) and can be considered as an extension of the latter. 
In many real-world applications, one is usually required to generate graphs 
with some condition or control from the users. For example, one may want 
to generate molecules under some targeted properties (see Chapters 24 and 
25) or programs under some function (see Chapter 22). In addition, graph- 
to-graph transformation also has a connection to link prediction (Chapter 
10) and node classification (Chapter 4), though the former could be more 
challenging since it typically requires simultaneous node-edge prediction, 
and possibly also comes with the consideration of stochasticity. 
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Abstract The problem of graph matching that tries to establish some kind of struc- 
tural correspondence between a pair of graph-structured objects is one of the key 
challenges in a variety of real-world applications. In general, the graph matching 
problem can be classified into two categories: i) the classic graph matching problem 
which finds an optimal node-to-node correspondence between nodes of a pair of in- 
put graphs and ii) the graph similarity problem which computes a similarity metric 
between two graphs. While recent years have witnessed the great success of GNNs 
in learning node representations of graphs, there is an increasing interest in explor- 
ing GNNs for the graph matching problem in an end-to-end manner. This chapter 
focuses on the state of the art of graph matching models based on GNNs. We start 
by introducing some backgrounds of the graph matching problem. Then, for each 
category of graph matching problem, we provide a formal definition and discuss 
state-of-the-art GNN-based models for both the classic graph matching problem 
and the graph similarity problem, respectively. Finally, this chapter is concluded by 
pointing out some possible future research directions. 
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13.1 Introduction 


As graphs are natural and ubiquitous representations for describing sophisticated 
data structures, the problem of graph matching that tries to establish some kind 
of structural correspondence between two input graph-structured objects. The graph 
matching problem is one of the key challenges in a variety of research fields, such as 
computer vision (Vento and Foggia, 2013), bioinformatics (Elmsallati et al, 2016), 
cheminformatics (Koch et al, 2019; Bai et al, 2019b), computer security (Hu et al, 
2009; Wang et al, 20191), source/binary code analysis (Li et al, 2019h; Ling et al, 
2021), and social network analysis (Kazemi et al, 2015), to name just as few. In 
particular, recent research advances in graph matching have been closely involved 
in many real-world applications in the field of computer vision, including visual 
tracking (Cai et al, 2014; Wang and Ling, 2017), action recognition (Guo et al, 
2018a), pose estimation (Cao et al, 2017, 2019), etc. In addition to the study in 
computer vision, graph matching also serves as an important foundation of many 
other graph-based research tasks, e.g., node and graph classification tasks (Richiardi 
et al, 2013; Bai et al, 2019c; Ok, 2020), graph generation tasks (You et al, 2018b; 
Ok, 2020), etc. 

In a broad sense, according to different goals of graph matching in a wide vari- 
ety of real-world applications, the general graph matching problem can be classified 
into two categories (Yan et al, 2016) as follows. The first category is the classic 
graph matching problem (Loiola et al, 2007; Yan et al, 2020a) that tries to estab- 
lish the node-to-node correspondence (and/or even edge-to-edge correspondence) 
between the pair of input graphs. The second category is the graph similarity prob- 
lem (Bunke, 1997; Riesen, 2015; Ma et al, 2019a) with the purpose of computing 
a similarity score between two input graphs. Both categories have the same inputs 
(i.e., a pair of input graphs) but with different outputs, whereby the output of the first 
category is mainly formulated as a correspondence matrix while the output of the 
second category is usually expressed as a similarity scalar. From the perspective of 
outputs, the second graph similarity problem can be viewed as a special case of the 
first graph matching problem, as the similarity scalar reflects a more coarse-grained 
correspondence representation of graph matching than the correspondence matrix. 

Generally, both categories of the graph matching problem are known to be NP- 
hard (Loiola et al, 2007; Yan et al, 2020a; Bunke, 1997; Riesen, 2015; Ma et al, 
2019a), making both problems computationally infeasible for exact and optimum 
solutions in large-scale and real-world settings. Given the great importance and in- 
herent difficulty of the graph matching problem, it has been heavily investigated in 
theory and practice and a huge number of approximate algorithms based on theo- 
retical/empirical knowledge of experts have been proposed to find sub-optimal so- 
lutions in an acceptable time. Interested readers are referred to (Loiola et al, 2007; 
Yan et al, 2016; Foggia et al, 2014; Riesen, 2015) for a more extensive review, as 
these approximation methods are beyond the scope of this chapter. Unfortunately, 
despite various approximation methods have been devoted to resolving the graph 
matching problem for the past decades, it still suffers from the issue of poor scala- 
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bility as well as the issue of heavy reliance on expert knowledge, and thus remains 
as a challenging and significant research problem for many practitioners. 

More recently, GNNs that attempt to adapt deep learning from image to non- 
euclidean data (i.e., graphs) have received unprecedented attention to learn infor- 
mative representation (e.g., node or (sub)graph, etc.) of graph-structured data in 
an end-to-end manner (Kipf and Welling, 2017b; Wu et al, 2021d; Rong et al, 
2020c). Hereafter, a surge of GNN models have been presented for learning effective 
node embeddings for downstream tasks, such as node classification tasks (Hamil- 
ton et al, 2017a; Veličković et al, 2018; Chen et al, 2020m), graph classification 
tasks (Ying et al, 2018c; Ma et al, 2019d; Gao and Ji, 2019), graph generation 
tasks (Simonovsky and Komodakis, 2018; Samanta et al, 2019; You et al, 2018b) as 
so on. The great success of GNN-based models on these application tasks demon- 
strates that GNN is a powerful class of deep learning model to better learn the graph 
representation for downstream tasks. 

Encouraged by the great success of GNN-based models obtained from many 
other graph-related tasks, many researchers have started to adopt GNNs for the 
graph matching problem and a large number of GNN-based models have been pro- 
posed to improve the matching accuracy and efficiency (Zanfir and Sminchisescu, 
2018; Rolínek et al, 2020; Wang et al, 2019g; Jiang et al, 2019a; Fey et al, 2020; Yu 
et al, 2020; Wang et al, 2020); Bai et al, 2018, 2020b, 2019b; Xiu et al, 2020; Ling 
et al, 2020; Zhang, 2020; Wang et al, 2020f; Li et al, 2019h; Wang et al, 20191). 
During the training stage, these models try to learn a mapping between the pair 
of input graphs and the ground-truth correspondence in a supervised learning and 
thus are more time-efficient during the inference stage than traditional approxima- 
tion methods. In this chapter, we walk through the recent advances and develop- 
ments of graph matching models based on GNNs. Particularly, we focus on how 
to incorporate GNNs into the framework of graph matching/similarity learning and 
try to provide a systematic introduction and review of state-of-the-art GNN-based 
methods for both categories of the graph matching problem (i.e., the classic graph 
matching problem in Section 13.2 and the graph similarity problem in Section 13.3, 
respectively). 


13.2 Graph Matching Learning 


In this section, we start by introducing the first category of the graph matching 
problem, i.e., the classic graph matching problem!, and provide a formal definition 
of the graph matching problem. Subsequently, we will focus discussion on state-of- 
the-art graph matching models based on deep learning as well as more advanced 
GNNs in the literature. 


' For simplicity, we represent the classic graph matching problem as the graph matching problem 
in the following sections of this chapter. 
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13.2.1 Problem Definition 


A graph of size n (i.e., numbers of nodes) can be represented as 9 = (V,&,A,X,E), 
in which Y = {v1,--- ,v,} denotes the set of nodes (also known as vertices), & C 
VY x V denotes the set of edges, A € {0,1}"*" denotes the adjacency matrix, X € 
R”* denotes the initial feature matrix of nodes, and E € R”*”*% denotes an optional 
initial feature matrix of edges. 

The purpose of the graph matching problem is to find an optimal node-to-node 
correspondence between two input graphs, i.e., 9 0) and 4C), Without loss of gen- 
erality, we consider the graph matching problem whose two input graphs of equal 
size”. In particular, we provide a formal definition of the graph matching problem 
in Definition 13.1 as follows and give an example illustration of the node-to-node 
correspondence in Fig. 13.1. 


Definition 13.1 (Graph Matching Problem). Given a pair of input graphs 4) = 
(YO, E AD XM BM) and G?) = (VO), AQ x?) EB) of equal size n, 
the graph matching problem is to find a node-to-node correspondence matrix S € 
{0,1}"*" (e., also called assignment matrix and permutation matrix) between the 
two graphs G\) and Y?), Each element Sia = 1 if and only if the node v; € y) in 
GU) corresponds to the node v4 € VO in GO), 


Intuitively, the resulting correspondence matrix S represents the possibility of es- 
tablishing a matching relation between any pair of nodes in two graphs. The graph 
matching problem is known to be NP-hard and has been investigated by formulating 
it as a quadratic assignment problem (QAP) (Loiola et al, 2007; Yan et al, 2016). We 
adopt the general form of Lawler’s QAP (Lawler, 1963) with constraints as follows 
since it has been widely adopted in literature. 


Fig. 13.1 An example illus- 
tration of the graph match- 
ing problem with two input 
graphs, i.e., the left graph Z) 
and the right graph Y) to be 
matched. The red dotted lines 
represent the node-to-node 
correspondences between the 
two graphs. 


2 For simplicity, we assume that a pair of input graphs in the graph matching problem have the same 
number of nodes, but we can extend the problem to a pair of graphs with different number of nodes 
via adding dummy nodes, which is commonly adopted by graph matching literature Krishnapuram 
et al (2004). 
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s* = argmax s' Ks 
s (13.1) 
s.t. Sla =l, & Sll, = 1, 


where s = vec(S) € {0, yr is the column-wise vectorized version of the assignment 
matrix S and 1, is a column vector of length n whose elements are equal to 1. 
Particularly, K € R” x"? is the corresponding second-order affinity matrix in which 
each element K;j ap measures how well every pair of nodes (vi,vj) € WY) x yO) 
matches (va, vp) € y) x W() and can be defined as follows (Zhou and De la Torre, 
2012). 
Cia ifi=janda=b, 
Kina(i,j)ind(ab) = $ dijab else if AAC? > 0, (13.2) 


0 otherwise. 


where ind(-,-) is a bijection function that maps a pair of nodes to an integer index, 
the diagonal element (i.e., Cia) encodes the node-to-node (i.e., first-order) affinity 
between the node v; € W\!) and the node Va E€ V2), and the off-diagonal element 
(i.e., dijap) encodes the edge-to-edge (i.e., second-order) affinity between the edge 
(vi, vj) € €") and the edge (va, vp) € €). 

Another important aspect for the formulation in Equation (13.1) is the constraint, 
ie., Sl, =1, and S'1, =1,. It demands that the matching output of the graph 
matching problem, i.e., the correspondence matrix S € {0,1}”*", should be strictly 
constrained as a doubly-stochastic matrix. Formally the correspondence matrix S$ 
is a doubly-stochastic matrix if the summation of each column and each row of it is 
1. That is, Vi, $}; S; j = 1 and Vj, X}; Si; = 1. Therefore, the resulting correspondence 
matrix of the graph matching problem should satisfy the requirement of the doubly- 
stochastic matrix. 

In general, the main challenge in optimizing and solving Equation (13.1) lies in 
how to model the affinity model as well as how to optimize with the constraint for 
solutions. Traditional methods mostly utilize pre-defined affinity models with lim- 
ited capacity (e.g., Gaussian kernel with Euclid distance Cho et al (2010)) and resort 
to different heuristic optimizations (e.g., graduated assignment (Gold and Rangara- 
jan, 1996), spectral method (Leordeanu and Hebert, 2005), random walk (Cho et al, 
2010), etc.). However, such traditional methods suffer from poor scalability and 
inferior performance for large-scale settings as well as a broad of application sce- 
narios (Yan et al, 2020a). Recently, studies on the graph matching are starting to 
explore the high capacity of deep learning models, which achieve state-of-the-are 
performance. In the following subsections, we will first give a brief introduction of 
deep learning based graph matching models and then discuss state-of-the-art graph 
matching models based on GNNs. 
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13.2.2 Deep Learning based Models 


Aiming at increasing the matching performance, extensive research interest in lever- 
aging high capacity of deep learning models to solve the problem of graph matching 
has been ignited since Zanfir and Sminchisescu (2018), which introduces an end-to- 
end deep learning framework for the graph matching problem for the first time and 
receives the best paper honorable mention award in CVPR 2018°. 


Deep Graph Matching. In (Zanfir and Sminchisescu, 2018), Zanfir and Sminchis- 
escu first relax the graph matching problem of Equation (13.1) with the 42 constraint 
as follows. 
s* = argmax s' Ks 
s (13.3) 
s.t. ||s|2 = 1 


To solve the problem, they attempt to introduce deep learning techniques to 
the graph matching and propose an end-to-end training framework with standard 
differentiable backpropagation and optimization algorithms. The proposed deep 
graph matching framework first uses the existing pre-trained CNN model (i.e., 
VGG-16 (Simonyan and Zisserman, 2014b)) to extract node features (i.e., U () and 
U®) € R"™4) and edge features (i.e., FY € RP*?4 and F®) e R44) from the pair 
of input images in the scenario of computer vision applications. In particular, F (1) 
and F(2) are row-wise edge feature matrices with p and q as the number of edges in 
each graph, respectively. As each edge attribute is the concatenation of the start and 
the end node, the dimension of edge attribute is double 2d the dimension of node. 

Next, based on extracted node/edge features, it builds the graph matching affinity 
matrix K via a novel factorization method of graph matching (Zhou and De la Torre, 
2012) as follows. 


K = [vec(Kp)| + (G2 ®Gi)[vec(Ke) | (H2 @ H1)" 


= | vey )")| + (G2 @G1) | veo(FM AF?) | (QH)! a 
where [X | denotes a diagonal matrix whose diagonal elements are all X; @ denotes 
the Kronecker product; G; and H; (i = {1,2}) are the node-edge incidence matrices 
that are recovered from the adjacency matrices A“), i.e., AÙ = GH, (i = {1,2}); 
Kp € R"*" encodes the node-to-node similarity and is directly obtained from the 
product of two node feature matrices, i.e., Kp = uy. K, € R?*4 encodes the 
edge-to-edge similarity and is calculated by Ke = F (AF), It is worth to note 
that A € R44 is a learnable parameter matrix and thus the built graph matching 
affinity matrix K in Equation (13.4) is a learnable affinity model. 

Then, with the spectral matching technique (Leordeanu and Hebert, 2005), the 
graph matching problem is translated into computing the leading eigenvector s* 
which can be approximated by the power iteration algorithm as follows. 


À https://www.thecvf.com/?page_id=413 
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Ksk 


—— 13. 
[Ksel Pa 


Sk+1 = 


in which s is initialized with sọ = 1 and K is computed from Equation (13.4). It 
is also worth to note that the spectral graph matching solver in Equation (13.5) is 
differentiable but un-learnable. Because the resulting S41 is not a doubly-stochastic 
matrix, it employs a bi-stochastic normalization layer to iteratively normalize the 
matrix by columns and rows over and over again. 

Finally, the whole graph matching model is trained in an end-to-end fashion with 
a displacement loss -25, which operates the difference between predicted displace- 
ment and the ground-truth displacement. 


n 
Lisp = ylldi-af 2 +£ and d= Yo (SigP?)-P™) 036 
i=0 


vae V2) 


where PC) and P) are coordinates of nodes in both images; the vector of d; mea- 
sures the pixel offset; a” is the corresponding ground-truth; and € is a small value 
for robust penalty. 


Deep Graph Matching via Black-box Combinatorial Solver. Motivated by ad- 
vances in incorporating a combinatorial optimization solver into a neural net- 
work (Pogancic et al, 2020), Rolínek et al (2020) propose an end-to-end neural 
network which seamlessly embeds a black-box combinatorial solver, namely BB- 
GM, for the graph matching problem. To be specific, given two cost vectors (i.e., 
œ ER” and ec’ € REM|EO!) for both node-to-node and edge-to-edge correspon- 
dences, the graph matching problem is formulated as follows. 


GM(c',c°)=  argmin  {¢"-s"+e°-s°} (13.7) 
(s’,s?)eAdm(Y()) (2) 


where GM denotes the black-box combinatorial solver; s” € {0, 1} is the indicator 


vector of matched nodes; s° € {0, pyeMlle | is the indicator vector of matched 
edges; Adm(Y) g (2)) represents a set of all possible matching results between 
G) and G?), 

By the formulation, the core of the graph matching problem is to construct the 
two cost vectors c” and c°. Therefore, BB-GM first employs a pre-trained VGG-16 
model to extract node embeddings and learns edge embeddings via Spline CNN (Fey 
et al, 2018). Then, based on the learned node embeddings, c” is computed by a 
weighted inner product similarity between the pair of node embeddings between 
two graphs, along with a learnable neural network based on the graph-level feature 
vector. Similarly, c° is also computed by a weighted inner product similarity be- 
tween the pair of edge embeddings between two graphs, along with the same neural 
network. 
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13.2.3 Graph Neural Network based Models 


More recently, GNNs have started to be studied to deal with the graph matching 
problem. This is because GNNs bring about new opportunities on the tasks over 
graph-like data and further improve the model capability taking structural informa- 
tion of graphs into account. Besides, GNNs can be easily incorporated with other 
deep learning architectures (e.g., CNN, RNN, MLP, etc.) and thus provide an end- 
to-end learning framework for the graph matching problem. 


Cross-graph Affinity based Graph Matching. Wang et al (2019g) claim that it is 
the first work that employs GNNs for deep graph matching learning (as least in com- 
puter vision). By exploiting the highly efficient learning capabilities of GNNs that 
can update the node embeddings with the structural affinity information between 
two graphs, the graph matching problem, i.e., the quadratic assignment problem, is 
translated into a linear assignment problem that can be easily solved. 

In particular, the authors present the cross-graph affinity based graph match- 
ing model with the permutation loss, namely PCA-GM. PCA-GM consists of three 
steps. First, to enhance learned node embeddings of individual graph with a stan- 
dard message-passing network (i.e., intra-graph convolution network), PCA-GM 
further updates node embeddings with an extra cross-graph convolution network, 
i.e., CrossGConv which not only aggregates the information from local neighbors, 
but also incorporates the information from the similar nodes in the other graph. 
Fig. 13.2 illustrates an intuitive comparison between the intra-graph convolution 
network and the cross-graph convolution network formulated as follows. 


H = CrossGConv (S,HVED HOD) 


HCE) = CrossGConv (SF YD) (13.8) 


where HO% and H@) are the k-layer node embeddings for the graph ¥ 0) and 
g); k denotes the k-th iteration; Ê denotes the predicted assignment matrix which 
is computed from shallower node embedding layers; and the initial embeddings, 


Fig. 13.2 For one node in 
the left graph Y“), the intra- 
graph convolution network 
only operates on its own A 
graph, i.e., the purple solid - 
lines in Y"). However, the ae 
cross-graph convolution net- Zee - “> = 
work operates on both its own \. 7 = ~@ 

graph (i.e., the purple solid s 

lines in Y)) as well as the Š 

other graph (i.e., blued dashed m N 

lines from all nodes in Y) to i 5 
the node in 40). ge) g® 


D : node embeddings 
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ie., H (1)(0) and H (2)(0) are extracted via a pre-trained VGG-16 network in line 
with Zanfir and Sminchisescu (2018). 

Second, based on the resulting node embeddings H and H® for both graphs, 
PCA-GM computes the node-to-node assignment matrix S by a bi-linear mapping 
followed by an exponential function as follows. 


R A @ He) 
F=exp(7 OF) (13.9) 


where © denotes the learnable parameter matrix for the assignment matrix learn- 
ing and T > 0 is a hyper-parameter. As the obtained S € R”*” does not satisfy the 
constraint of the doubly-stochastic matrix, PCA-GM uses the Sinkhorn (Adams and 
Zemel, 2011) operation for the relaxed linear assignment problem because it is fully 
differentiable and has been proven effective for the final graph matching prediction. 


S = Sinkhorn(S) (13.10) 

Finally, PCA-GM adopts the combinatorial permutation loss that computes the 

cross entropy loss between the final predicted permutation S and ground truth per- 
mutation S% for supervised graph matching learning. 


Lym=— FSF l0g(Sin) + (1 — Sf,)log(1 — Sia) 


ia 


vev) vae VB) (13.11) 


Experiment results in (Wang et al, 2019g) demonstrated that graph matching mod- 
els with the permutation loss outperform that with the displacement loss in Equa- 
tion (13.6). 


Graph Learning—Matching Network. Most prior studies on the graph matching 
problem rely on established graphs with fixed structure information, i.e., the edge set 
with or without attributes. Differently, Jiang et al (2019a) present a graph learning- 
matching network, namely GLMNet, which incorporates the graph structure learn- 
ing (i.e., learning the graph structure information) into the general graph matching 
learning to build a unified pan model architecture. To be specific, based on 
the pair of node feature matrices X0 = = {x®,.. ae 0) (= N 1,2}), GLMNet at- 
tempts to learn a pair of optimal graph ee matrices A (J = {1,2}) for bet- 
ter serving for the latter graph matching learning and each element is computed as 
follows. 


— exp(o(07 kP, xN) 
En, exp(o (0T (x, x!))) 


,1={1,2} (13.12) 


where o is the activation function, e.g., ReLU; |-,-] denotes the concatenation oper- 
ation; and @ denotes the trainable parameter for the graph structure learning which 
is shared for both input graphs. 
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Following PCA-GM (Wang et al, 2019g), GLMNet also explores a series of 
graph convolution modules to learn informative node embeddings of both input 
graphs for the latter affinity matrix learning and matching prediction. Based on the 
obtained A“) and X” (J = {1,2}), GLMNet employs the graph smoothing convolu- 
tion layer (Kipf and Welling, 2017b), the cross-graph convolution layer Wang et al 
(2019g) and the graph sharpening convolution layer (i.e., defined as the counterpart 
of Laplacian smoothing in (Kipf and Welling, 2017b)) to further learn and update 
their node embeddings i.e., XO d= {1,2}). After that, GLMNet directly computes 
the node-to-node assignment matrix § by Equations (13.9) and (13.10), which is 
exactly the same as PCA-GM (Wang et al, 2019) does. 

In addition to the permutation cross entropy loss -perm defined in Equation (13.11), 
GLMNet adds an extra constraint regularized loss Zon for better satisfying the per- 
mutation constraint, i.e., 2 = Lyerm + ALeon with A > 0, in which Leon is defined 
as follows. 

Leon = Ł Ł Uij abSiaS j,b 


vivje y ® va,vpE V) 


e a (13.13) 
U= 1 ifi=j,afboriF¥ j,a= b; 
aaa 0 otherwise. 


Diy ie . . : : 
where U € R” *™ represents the conflict relationships of all matches and the opti- 


mum correspondence Smeans } LY UijabSiasjp = 9. 
vivje y) VasvpEV 2) 


Deep Graph Matching with Consensus. In (Fey et al, 2020), Fey et al also 
employ GNNs to learn the graph correspondence as previous work, but addition- 
ally introduce a neighborhood consensus Rocco et al (2018) to further refine the 
learned correspondence matrix. Firstly, they use common GNN models along with 
the Sinkhorm operation to compute an initial correspondence matrix S° as follows. 
Yo, denotes the shared GNN model for both graphs. 


H = % (X,AO EO), 1 = {1,2} 


(13.14) 

S° = Sinkhorn(H #7) 

Then, to reach a neighborhood consensus between the pair of matched nodes, 

they refine the initial correspondence matrix S° via another trainable GNN model 
(i.e., Yo, ) and an MLP model (i.e., g,). 


o0) = Po, (n, A®, E®) 
0?) = Y (ST, AP, E®) ee 
Sif! = Sinkhorn (Ska + 6, (of? = o?)) 


ly 


(1) (2) 


where J, is the identify matrix and 0; ° — 0,” is computed as the neighborhood 


consensus between the node pair (vi, va) € VO) x W between two graphs (e.g., 


o(!) z oP # 0 means a false matching over the neighborhoods of v; and v ;). Finally, 
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S£ is obtained after K iterations and the final loss function incorporates both feature 
matching loss and neighborhood consensus loss, i.e., Z = Li" + Gresine, 


gint — -y log (9 ister (i) w) 


vey) 


prefine a 2 log (Fat) 


vieV (1) 


(13.16) 


where 7,;(i) denotes the ground truth correspondence. 


Deep Graph Matching with Hungarian Attention. Yu et al (2020) present an 
end-to-end deep learning model which is almost identical to Wang et al (2019g), 
including a graph embedding layer based on GNNs, an affinity learning layer (i.e., 
Equations (13.9) and (13.10)), and the permutation loss (i.e., Equation (13.11)). 
However, they improve the model with two main contributing aspects. The first as- 
pect is adopting a novel node/edge embedding operation (i.e., CIE) to replace the 
commonly used GCN operation that simply updates node embeddings while ig- 
nores the rich edge attributes. Since the edge information provides a crucial role 
in determining the graph matching result, CIE updates both node and edge embed- 
ding simultaneously by a channel-wise updating function in a multi-head fashion. 
Interested readers are referred to Section 3.2 in (Yu et al, 2020). Another aspect is a 
novel loss function. As the previously used permutation loss is prone to overfitting, 
the authors devise a novel loss function that introduces a Hungarian attention Z into 
the permutation loss as follows. 


Z = Attention(Hungarian(S), 5%") 
Lung=- E Zia(S$ log(Sia)+(1-S%)log(-Sia)) 13-17 


1,4 
ve VO) vae V2 


where Hungarian denotes a black-box Hungarian algorithm and the role of Z is like 
a mask that attempts to focus more on those mismatched node pairs and focus less 
on node pairs that are matched exactly. 


Graph Matching with Assignment Graph. Differently, Wang et al (2020)) refor- 
mulate the graph matching problem as the problem of selecting reliable nodes in 
the constructed assignment graph (Cho et al, 2010) in which each node represents a 
potential node-to-node correspondence. The formal definition of assignment graph 
is given in Definition 13.2 and one example is illustrated in Fig. 13.3. 


Definition 13.2 (Assignment Graph). Given two graphs 4” = (V 0) g0) x® 
and G?) = (Y), E x) E)), an assignment graph $“) = (1%) 64 x4), EA) 
is constructed as follows. Y) takes each candidate correspondence (vt „va ) € 

VY x yl) erie ‘ous nP as a node vig € V4) and link an edge between a 


pair of nodes yA) A eya ) Cie., (ve) v) € £^) if and only if both edges i.e., 


ia 0V 


Vs 
— 
N 
— Nee 


Cans )) e g0) fe ane ) v?)) € &), exist in its original graph. Optionally, for 
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Fig. 13.3: Example illustration of building an assignment graph 4 (4) from the pair 
of graphs 4) and GY), 


node attributes X^) and edge attributes E (4) each of them could be obtained by 
concatenating attributes of the pair of nodes or edges in the original graph, respec- 
tively. 


With the constructed assignment graph 9 (4) the reformulated problem of select- 
ing reliable nodes in Y (A) is quite similar to binary node classification tasks Kipf and 
Welling (2017b) that classify nodes into positive or negative (i.e., meaning matched 
or un-matched). To solve the problem, the authors propose a fully learnable model 
based on GNNs which takes the Y) as input, iteratively learns node embeddings 
over graph structural information and predicts a label for each node in Y (4) as out- 
put. Besides, the model is trained with a similar loss function to (Jiang et al, 2019a). 


13.3 Graph Similarity Learning 


In this section, we will first introduce the second category of the general graph 
matching problem — the graph similarity problem. Then, we will provide an ex- 
tensive discussion and analysis of state-of-the-art graph similarity learning models 
based on GNNs. 


13.3.1 Problem Definition 


Learning a similarity metric between an arbitrary pair of graph-structured objects 
is one of the fundamental problems in a variety of applications, ranging from sim- 
ilar graph searching in databases (Yan and Han, 2002), to binary function analy- 
sis (Li et al, 2019h), unknown malware detection (Wang et al, 20191), semantic code 
retrieval (Ling et al, 2021), etc. According to different application backgrounds, 
the similarity metric can be defined by different measures of structural similarity, 
such as graph edit distance (GED) (Riesen, 2015), maximum common subgraph 
(MCS) (Bunke, 1997; Bai et al, 2020c), or even more coarse binary similarity (.e., 
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similar or not) (Ling et al, 2021). As GED is equivalent to the problem of MCS 
under a fitness function (Bunke, 1997), in this section, we mainly consider the GED 
computation and focus more on state-of-the-art graph similarity learning models 
based on GNNs. 

Basically, the graph similarity problem intends to compute a similarity score be- 
tween a pair of graphs, which indicates how similar the pair of graphs is. In the 
following Definition 13.3, the general graph similarity problem is defined. 


Definition 13.3 (Graph Similarity Problem). Given two input graphs Y"!) and 
Gl), the purpose of graph similarity problem is to produce a similarity score s 
between Y!) and Y). In line with the notations in Section 13.2.1, the Y"!) = 
(1,80) A xO) is represented as set of n nodes v; € y0) with a feature ma- 
trix X) € R"*4, edges (v;,vj) € EU) formulating an adjacency matrix A), Simi- 
larly, Z” = (VY) ,) A?) xX is represented as set of m nodes va € V) witha 
tie matrix X®) € R™¢, edges (va, vp) E€ Ê (2) formulating an adjacency matrix 
A), 


For the similarity score s, if s € R, the graph similarity problem can be considered 
as the graph-graph regression tasks. On the other hand, if s € {—1,1}, the problem 
can be considered as the graph-graph classification tasks. 

Particularly, the computation of GED (Riesen, 2015; Bai et al, 2019b) (some- 
times normalized in [0, 1]) is a typical case of graph-graph regression tasks. To be 
specific, GED is formulated as the cost of the shortest sequence of edit operations 
over nodes or edges which have to undertake to transform one graph into another 
graph, in which an edit operation can be an insertion or a deletion of a node or an 
edge. In Fig. 13.4, We give an illustration of GED computation. 

Similar to the classic graph matching problem, the computation of GED is also 
a well-studied NP-hard problem. Although there is a rich body of work (Hart et al, 
1968; Zeng et al, 2009; Riesen et al, 2007) that attempts to find sub-optimal so- 
lutions in polynomial time via a variety of heuristics (Riesen et al, 2007; Riesen, 
2015), these heuristic methods still suffer from the poor scalability (e.g., large search 
space or excessive memory) and heavy reliance on expert knowledge (e.g., various 
heuristics based on different application cases). Currently, learning-based models 
which incorporate GNNs into an end-to-end learning framework for graph similar- 
ity learning are gradually becoming more and more available, demonstrating the 


Fig. 13.4 Illustration of com- 


puting the GED score between (1) (a) 
G) and Y®), Since Y can o 

be transform into in Y?) by © G) (b) (c) 
deleting the edge (v2, v3) or 

4) can be transformed into @ 


in Y) by inserting the edge ` , © 
(vb, Vc), the GED between gQ) Se 7 g2) 
two graphs is 1. 
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superiority by traditional heuristic methods in both effectiveness and efficiency. In 
two following subsections, we will discuss state-of-the-art GNN-based graph simi- 
larity models for graph-graph regression tasks and graph-graph classification tasks, 
respectively. 


13.3.2 Graph-Graph Regression Tasks 


As mentioned above, the graph-graph regression task refers to computing a similar- 
ity score between a pair of graphs and we focus on the graph similarity learning on 
GED in this subsection. 


Graph Similarity Learning with Convolutional Set Matching. Aiming at accel- 
erating the graph similarity computation while preserving a good performance, Bai 
et al (2018) first turn the computation of GED into a learning problem rather than 
approximation methods with combinatorial search, and then propose an end-to-end 
framework, namely GSimCNN, for the graph similarity learning. For GSimCNN 
in (Bai et al, 2018) (or GraphSim in (Bai et al, 2020b)*), it is probably the first work 
that applies both GNNs and CNNs for the task of GED computation and consists of 
three steps in general. First, GSimCNN employs multiple layers of standard GCNs 
to generate the node embedding vector for each node in the pair of graphs. Second, 
in each layer of GCNs, GSimCNN uses the BFS node-ordering scheme (You et al, 
2018b) to re-order the node embeddings and compute the inner product between the 
re-ordered node embeddings in two graphs to generate a node-to-node similarity 
matrix. Finally, after padding or resizing resulting node-to-node similarity matrices 
into square matrices, the authors transform the task of graph similarity computation 
into an image processing problem and explore standard CNNs and MLPs for the fi- 
nal graph similarity prediction. GSimCNN is trained with a mean squared error loss 
function based on predicted similarity scores and the corresponding ground-truth 
scores. 


Graph Similarity Learning with Graph-Level Interaction. Soon after, Bai et al 
present another GNN-based model, called SimGNN, for graph similarity learning. 
In SimGNN, it takes not only node-level interactions but also graph-level interac- 
tions as considerations for jointly learning the graph similarity score. For the node- 
level similarity between two graphs, it first adopts a similar approach like GSim- 
CNN to generate the node-to-node similarity matrix, and then extract a histogram 
feature vector from the matrix as the node-level comparison information. For the 
graph-level similarity between two graphs, SimGNN first employs a simple graph 
pooling model via an attention mechanism to generate one graph-level embedding 
vector for each graph (hy) and hy) ) and then adopts a trainable neural tensor net- 
work (NTN) (Socher et al, 2013) to model the relationship between the two graph- 


4 Tt seems that the model architecture of GSimCNN in (Bai et al, 2018) is the same as that of 
GraphSim in (Bai et al, 2020b), which evaluates the model with additional datasets and similarity 
metrics (i.e., both GED and MCS). 
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level embedding vectors as follows. 


h, 
NTN (hg), hye) ) = o (nyo WE hg +V eae ] +b) (13.18) 

I 
where o is the activation function and [] denotes the concatenation operation. 


In addition, WIE] V and b are parameters in NTN to be learned and K is a 
hyper-parameter which determines the length of the graph-level similarity vector 
calculated by NTN. Finally, to compute the similarity score between two graphs, 
SimGNN concatenates two similarity vectors from the node level and the graph 
level along with a small MLP network for prediction. 


Graph Similarity Learning based on Hierarchical Clustering. In (Xiu et al, 
2020), Xiu et al argue that if two graphs are similar, their corresponding compact 
graphs should be similar with each other and conversely if two graphs are dissim- 
ilar, their corresponding compact graphs should also be dissimilar. They believe 
that, for the input pair of graphs, different views in regard to different pairs of com- 
pact graphs can provide different scales of similarity information between two input 
graphs and thus benefit the graph similarity computation. To this end, a hierarchical 
graph matching network (HGMN) (Xiu et al, 2020) is presented to learn the graph 
similarity from a multi-scale view. Concretely, HGMN first employs multiple stages 
of hierarchical graph clustering to successively generate more compact graphs with 
initial node embeddings to provide a multi-scale view of differences between two 
graphs for subsequent model learning. Then, with the pairs of compact graphs in 
different stages, HGMN computes the final graph similarity score by adopting a 
GraphSim-like model (Bai et al, 2020b), including node embeddings update via 
GCNs, similarity matrices generation and prediction via CNNs. However, in order 
to ensure permutation invariance of generated similarity matrices, HGMN devises a 
different node-ordering scheme based on earth mover distance(EMD) (Rubner et al, 
1998) rather than BFS node-order method in (Bai et al, 2020b). According to the 
EMD distance, HGMN first aligns nodes for both input graphs in each stage and 
then produces the corresponding similarity matrix in the aligned order. 


Graph Similarity Learning with Node-Graph Interaction. To learn richer in- 
teraction features between a pair of input graphs for computing the graph similar- 
ity in an end-to-end fashion, Ling et al propose a multi-level graph matching net- 
work (MGMN) (Ling et al, 2020) which consists of a siamese graph neural network 
(SGNN) and a novel node-graph matching network (NGMN). To learn graph-level 
interactions between two graphs, SGNN first utilizes a multi-layer of GCNs with the 
siamese network to generate node embeddings H!) = tn? ye 3 eRtmm}xd for all 
nodes in graph GO,1= {1,2} and then aggregates a corresponding graph-level em- 
bedding vector for each graph. On the other hand, to learn cross-level interaction 
features between two graphs, NGMN further employs a node-graph matching layer 
to update node embeddings with learned cross-level interactions between node em- 
beddings of a graph and a corresponding graph-level embedding of the other whole 
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graph. Taking a node v; € VW in YO as an example, NGMN first computes an at- 
tentive graph-level oasis vector hr ee j for Y Gi) by weighted averaging all node 


embeddings in g) based on the A ies cross-graph attention coefficient 
towards v; as follows. 


he =} o; h, where Qj; = cosine(ht”) n°) Vvj€ yo (13.19) 
vjeV ` 


where att in the superscript of hr Means it is an attentive graph-level embedding 


vector of G®) in terms of the Ha viin Gl) 
Then, to update the node embedding of v; with cross-graph age a NGMN 


learns similarity feature vector between the node embedding (i.e., hi! )y and the at- 
tentive graph-level embedding vector (i.e., h’ r via a multi-perspective matching 
function. After performing the above ee -graph matching layer over all nodes for 
both graphs, NGMN aggregates a corresponding graph-level embedding vector for 
each graph. The full model MGMN concatenates the two aggregated graph-level 
embeddings from both SGNN and NGMN for each graph and feed those concate- 
nated embeddings into a final small prediction network for the graph similarity com- 
putation. 


Graph Similarity Learning based on GRAPH-BERT. As previous studies on the 
graph similarity learning are mostly trained in a supervised manner and cannot guar- 
antee the basic properties (e.g., triangle inequality) of the graph similarity metric 
like GED, Zhang introduces a novel training framework of GB-DISTANCE (Zhang, 
2020) based on GRAPH-BERT (Zhang et al, 2020a). First, GB-DISTANCE adapts 
the pre-trained GRAPH-BERT model to update node embeddings and further ag- 
gregate a graph-level representation embedding of vector hy, for the graph (i) 
Then, GB-DISTANCE computes the graph similarity d; ; between the pair of graphs 
(G Og (7) with several fully connected layers as follows. 


d(9,9) =1— exp ( -FC (hy — hy) * +2) (13.20) 


where FC denotes the employed fully connected layers and (-) « «2 denotes the 
element-wise square of the input vector. In (Zhang, 2020), GB-DISTANCE con- 
siders a scenario that inputs a set of m graphs (i.e., {gym 1) and outputs the 
ae between any pair of graphs, i.e., a similarity matrix D = {D;, iar = 
{a(g ) gO) na = € R”*™, and formulates the graph similarity problem in a su- 


as or semi- TOT settings as follows. 


1 if D; jis labeled 
min ||MO (D—Ô)||, with M;;=4 œ if D; jis unlabeled Ai # j 
B ifi=j 
s.t. Di j < Dik +Dp j, Vi, j,k € {1, ,m} 


(13.21) 
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where || - ||p denotes the L, norm; Ô denotes the ground-truth similarity matrix; M is 
a mask matrix for the semi-supervised learning with two hyper-parameters œ and B; 
the constraint of D; j < Dix + Dx, j, Vi, j,k € {1,--- ,m} tries to ensure the triangle 
inequality of graph similarity metrics. To optimize the model with such constraints, 
GB-DISTANCE devises a two-phase training algorithm with the constrained metric 
refining methods. 


Graph Similarity Computation based on A“. It is obviously observed that all 
these aforementioned approaches directly compute the GED similarity score be- 
tween two graphs, however, failing to produce the edit path, which can explic- 
itly express the sequence of edit operations for transforming one graph into the 
other graph. To output the edit path like the traditional A“ (Hart et al, 1968; 
Riesen et al, 2007) algorithm, Wang et al propose a graph similarity learning model 
GENN-A* (Wang et al, 2020f) which incorporates the existing solution of A” with 
a learnable GENN model based on GNNs. A* (Hart et al, 1968; Riesen et al, 2007) 
is a tree-searching algorithm which explores the space of all possible node/edge 
mappings between two graphs as an ordered search tree and further expands succes- 
sors of a node p in the search tree by the minimum induced edit cost g(p) +h(p), 
in which g(p) is the cost of current partial edit path induced so far and h(p) is 
the estimated cost of edit path between the remaining un-matched sub-graphs. Be- 
cause of the poor scalability of A", GENN-A* thus replaces the heuristics with a 
learning-based model (i.e., GENN) to predict h(p). GENN is almost the same as 
SimGNN (Bai et al, 2019b) with the removal of the histogram module and is used 
to predict a normalized GED score s(p) € (0, 1) between the remaining un-matched 
sub-graphs. After that, the h(p) is obtained as follows where ñ and / denote the 
numbers of nodes of the un-matched sub-graphs. 


h(p) = —0.5(A +m) log(s(p)) (13.22) 


13.3.3 Graph-Graph Classification Tasks 


In addition to the computation of GED, learning a binary label s € {—1,1} (ie., 
similar or not) between a pair of graphs can be view as a task of the graph-graph 
classification learning? and has been widely studied in many real-world applica- 
tions, including binary code analysis, source code analysis, malware detection, etc. 


Graph Similarity Learning via Cross-graph Matching. In the scenario of de- 
tecting whether two binary functions are similar or not, Li et al present a message- 
passing based graph matching network (GMN) (Li et al, 2019h) to learn a similarity 
label between the two control-flow graphs (CFGs) which represent two input bi- 
nary functions. In particular, GMN employs a similar cross-graph matching network 


5 The termed graph-graph classification learning is totally different from the general graph classifi- 
cation task (Ying et al, 2018c; Ma et al, 2019d) that only predicts a label for one input graph rather 
than a pair of input graphs. 
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based on standard message- ca se to iteratively generate more discrimina- 
tive node embeddings (e.g., H® = = {h ha cy)» L= {1,2} for two input graphs. 
Intuitively, it updates the node embeddings of one input graphs by incorporating the 
attentive association information of another through a soft attention, which is similar 
to the cross-graph convolution network introduced in Equation (13.8) and Fig. 13.2. 
Subsequently, in order to calculate the similarity score, GMN adopts an aggrega- 
tion operation (Li et al, 2016b) as follows to output a graph-level embedding vector 
(i.e., hg, L= {1,2}) for each graph and ro an existing similarity function 
for the final similarity prediction, i.e., s(hgq) Age) = fs hou), hco), where f; can 
be an arbitrary existing similarity function such as Euclidean, cosine or Hamming 
similarity function. 


gw = MLP% ( y 6 (MLPo2(h{"’)) © MLPg3(h{”)), 1={1,2} (13.23) 
vey 0) 


where o denotes the activation function; © denotes the element-wise multiplication 
operation; MLP»;, MLP92, MLP 93 are MLP networks to be trained. Based on dif- 
ferent supervisions of training samples (e.g., the ground-truth binary label between 
two graphs or relative similarity among three graphs), GMN adopts two margin- 
based loss functions, i.e., the pair loss function and the triplet loss function. As for 
different similarity functions fs employed, the formulation of the corresponding loss 
function is quite different. Thus, we refer interested readers for the loss functions 
to (Li et al, 2019h). 


Graph Similarity Learning on Heterogeneous Graphs. Motivated by ever-growing 
malware threats, a heterogeneous graph matching network (MatchGNet) frame- 
work (Wang et al, 2019i) is proposed for unknown malware detection. To better 
represent programs (e.g., benign or malicious) in enterprise systems and capture in- 
teraction relationships between system entities (e. g., files, processes, sockets, etc.), a 
heterogeneous invariant graph is constructed for each program. Therefore, the mal- 
ware detection problem is equivalent to detecting whether two representation graphs 
(i.e., the graph of the input program and the graph of the existing benign program) 
are similar or not. Due to the heterogeneity of the invariant graph, MatchGNet em- 
ploys a hierarchical attention graph neural encoder (HAGNE)-based GNN to learn 
a graph-level embedding vector for each program. Particularly, HAGNE first identi- 
fies path-relevant sets of neighbors via meta-paths (Sun et al, 2011) and then updates 
node embeddings by aggregating the entities under each path-relevant neighbor set. 
The graph-level embedding over all the meta-paths is computed by a weighted sum- 
marization of all embeddings of meta-paths. Finally, MatchGNet directly calculates 
the cosine similarity between the two graph-level embedding vector as the final pre- 
dicted label for malware detection. 
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13.4 Summary 


In this chapter, we have introduced the general graph matching learning, whereby 
objective functions are formulated for establishing an optimal node-to-node corre- 
spondence matrix between two graphs for the classic graph matching problem and 
computing a similarity metric between two graphs for the graph similarity problem, 
respectively. In particular, we have thoroughly analyzed and discussed state-of-the- 
art GNN-based graph matching models and graph similarity models. In the future, 
for better graph matching learning, some directions we believe are requiring more 
efforts: 


Fined-grained cross-graph features. For the graph matching problem which 
inputs the pair of graphs, interaction features between two graphs are funda- 
mental and key features in both the graph matching learning and the graph sim- 
ilarity learning. Although several existing methods (Li et al, 2019h; Ling et al, 
2020) have been devoted to learning interacted features between two graphs for 
better representation learning, these models have caused non-negligible extra 
computational overhead. Better fined-grained cross-graph feature learning with 
efficient algorithms could make a new state of the art. 

Semi-supervised learning and un-supervised learning. Because of the com- 
plexity of graphs in the real-world application scenarios, it is common to train 
the model in a semi-supervised setting or even in an un-supervised setting. Mak- 
ing full use of relationships between existing graphs and, if possible, the other 
data that is not directly relevant to the graph matching problem could further 
promote the development of graph matching/similarity learning in more practi- 
cal applications. 

Vulnerability and robustness. Although adversarial attacks have been exten- 
sively studied for image classification tasks (Goodfellow et al, 2015; Ling et al, 
2019) and node/graph classification tasks (Ziigner et al, 2018; Dai et al, 2018a), 
there is currently only one preliminary work (Zhang et al, 2020f) that studies 
adversarial attacks on the graph matching problem. Therefore, studying the vul- 
nerability of the state-of-the-art graph matching/similarity models and further 
building more robust models is a highly challenging problem. 


Editor’s Notes: Graph Matching Networks is an emerging research topic recently 
and have drawn significant number of interests in both research community and in- 
dustrial community due to its broad range of application domains such as computer 
vision (Chapter 20), Natural Language Processing (Chapter 21), Program Analysis 
(Chapter 22), Anomaly Detection (Chapter 26). Graph Matching Networks is built 
on graph node representation learning (Chapter 4) but focuses more on the interac- 
tion of two graphs from low-level nodes to high-level graphs. It has tight connection 
with link prediction (Chapter 10) and self-supervised learning (Chapter 18), where 
graph matching could be formulated as one of the sub-tasks for these graph learn- 
ing tasks. Obviously, adversarial robustness (Chapter 8) could have direct impact of 
graph matching networks, which has recently been extensively studied as well. 
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Graph Neural Networks: Graph Structure 
Learning 


Yu Chen and Lingfei Wu 


Abstract Due to the excellent expressive power of Graph Neural Networks (GNNs) 
on modeling graph-structure data, GNNs have achieved great success in various 
applications such as Natural Language Processing, Computer Vision, recommender 
systems, drug discovery and so on. However, the great success of GNNs relies on 
the quality and availability of graph-structured data which can either be noisy or 
unavailable. The problem of graph structure learning aims to discover useful graph 
structures from data, which can help solve the above issue. This chapter attempts 
to provide a comprehensive introduction of graph structure learning through the 
lens of both traditional machine learning and GNNs. After reading this chapter, 
readers will learn how this problem has been tackled from different perspectives, 
for different purposes, via different techniques, as well as its great potential when 
combined with GNNs. Readers will also learn promising future directions in this 
research area. 


14.1 Introduction 


Recent years have seen a significantly increasing amount of interest in Graph Neu- 
ral Networks (GNNs) (Kipf and Welling, 2017b; Bronstein et al, 2017; Gilmer 
et al, 2017; Hamilton et al, 2017b; Li et al, 2016b) with a wide range of appli- 
cations in Natural Language Processing (Bastings et al, 2017; Chen et al, 2020p), 
Computer Vision (Norcliffe-Brown et al, 2018), recommender systems (Ying et al, 
2018b), drug discovery (You et al, 2018a) and so on. GNN’s powerful ability in 
learning expressive graph representations relies on the quality and availability of 
graph-structured data. However, this poses some challenges for graph representation 
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learning with GNNs. On the one hand, in some scenarios where the graph structure 
is already available, most of the GNN-based approaches assume that the given graph 
topology is perfect, which does not necessarily hold true because 1) the real-word 
graph topology is often noisy or incomplete due to the inevitably error-prone data 
measurement or collection; and ii) the intrinsic graph topology might merely repre- 
sent physical connections (e.g the chemical bonds in molecule), and fail to capture 
abstract or implicit relationships among vertices which can be beneficial for certain 
downstream prediction task. On the other hand, in many real-world applications 
such as those in Natural Language Processing or Computer Vision, the graph rep- 
resentation of the data (e.g., text graph for textual data or scene graph for images) 
might be unavailable. Early practice of GNNs (Bastings et al, 2017; Xu et al, 2018d) 
heavily relied on manual graph construction which requires extensive human effort 
and domain expertise for obtaining a reasonably performant graph topology during 
the data preprocessing stage. 

In order to tackle the above challenges, graph structure learning aims to dis- 
cover useful graph structures from data for better graph representation learning with 
GNNs. Recent attempts (Chen et al, 2020m,o; Liu et al, 2021; Franceschi et al, 
2019; Ma et al, 2019b; Elinas et al, 2020; Velickovic et al, 2020; Johnson et al, 
2020) focus on joint learning of graph structures and representations without re- 
sorting to human effort or domain expertise. Different sets of techniques have been 
developed for learning discrete graph structures and weighted graph structures for 
GNNs. More broadly speaking, graph structure learning has been widely studied in 
the literature of traditional machine learning in both unsupervised learning and su- 
pervised learning settings (Kalofolias, 2016; Kumar et al, 2019a; Berger et al, 2020; 
Bojchevski et al, 2017; Zheng et al, 2018b; Yu et al, 2019a; Li et al, 2020a). Besides, 
graph structure learning is also closely related to important problems such as graph 
generation (You et al, 2018a; Shi et al, 2019a), graph adversarial defenses (Zhang 
and Zitnik, 2020; Entezari et al, 2020; Jin et al, 2020a,e) and transformer mod- 
els (Vaswani et al, 2017). 

This chapter is organized as follows. We will first introduce how graph structure 
learning has been studied in the literature of traditional machine learning, prior to 
the recent surge of GNNs (section 14.2). We will introduce existing works on both 
unsupervised graph structure learning (section 14.2.1) and supervised graph struc- 
ture learning (section 14.2.2). Readers will later see how some of the introduced 
techniques originally developed for traditional graph structure learning have been 
revisited and improve graph structure learning for GNNs. Then we will move to 
our main focus of this chapter which is graph structure learning for GNNs in sec- 
tion 14.3. This part will cover various topics including joint graph structure and 
representation learning for both unweighted and weighted graphs (section 14.3.1), 
and the connections to other problems such as graph generation, graph adversarial 
defenses and transformers (section 14.3.2). We will highlight some future directions 
in section 24.5 including robust graph structure learning, scalable graph structure 
learning, graph structure learning for heterogeneous graphs, and transferable graph 
structure learning. We will summarize this chapter in section 14.5. 
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14.2 Traditional Graph Structure Learning 


Graph structure learning has been widely studied from different perspectives in the 
literature of traditional machine learning, prior to the recent surge of Graph Neural 
Networks. Before we move to the recent achievements of graph structure learning 
in the field of Graph Neural Networks, which is the main focus of this chapter, 
in this section, we will first examine this challenging problem through the lens of 
traditional machine learning. 


14.2.1 Unsupervised Graph Structure Learning 


The task of unsupervised graph structure learning aims to directly learn a graph 
structure from a set of data points in an unsupervised manner. The learned graph 
structure may be later consumed by subsequent machine learning methods for var- 
ious prediction tasks. The most important benefit of this kind of approaches is that 
they do not require labeled data such as ground-truth graph structures for super- 
vision, which could be expensive to obtain. However, because the graph structure 
learning process does not consider any particular downstream prediction task on the 
data, the learned graph structure might be sub-optimal for the downstream task. 


14.2.1.1 Graph Structure Learning from Smooth Signals 


Graph structure learning has been extensively studied in the literature of Graph Sig- 
nal Processing (GSP). It is often referred to as the graph learning problem in the lit- 
erature whose goal is to learn the topological structure from smooth signals defined 
on the graph in an unsupervised manner. These graph learning techniques (Jebara 
et al, 2009; Lake and Tenenbaum, 2010; Kalofolias, 2016; Kumar et al, 2019a; Kang 
et al, 2019; Kumar et al, 2020; Bai et al, 2020a) typically operate by solving an opti- 
mization problem with certain prior constraints on the properties (e.g., smoothness, 
sparsity) of graphs. Here, we introduce some representative prior constraints defined 
on graphs which have been widely used for solving the graph learning problem. 

Before introducing the specific graph learning techniques, we first provide the 
formal definition of a graph and graph signals. Consider a graph Y = {V,&} with 
the vertex set VY of cardinality n and edge set &, its adjacency matrix A € R’*” 
governs its topological structure where A; ; > 0 indicates there is an edge connecting 
vertex i and j and A; j is the edge weight. Given an adjacency matrix A, we can 
further obtain the graph Laplacian matrix L = D — A where D;; = } jA; j is the 
degree matrix whose off-diagonal entries are all zero. 

A graph signal is defined as a function that assigns a scalar value to each vertex 
of a graph. We can further define multi-channel signals X € R”*¢ on a graph that 
assigns ad dimensional vector to each vertex, and each column of the feature matrix 
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X can be considered as a graph signal. Let X; € R? denote the graph signal defined 
on the i-th vertex. 

Fitness. Early works (Wang and Zhang, 2007; Daitch et al, 2009) on graph learning 
utilized the neighborhood information of each data point for graph construction by 
assuming that each data point can be optimally reconstructed using a linear com- 
bination of its neighbors. Wang and Zhang (2007) proposed to learn a graph with 
normalized degrees by minimizing the following objective, 


Dll La. Al? (14.1) 


where LjAij = 1, Ajj > 0. 

Similarly, Daitch et al (2009) proposed to minimize a measure of fitness that 
computes a weighted sum of the squared distance from each vertex to the weighted 
average of its neighbors, formulated as follows: 


> ||DiiXi -= Ai Xj)? = [|X ||7 (14.2) 
i j 

where ||M]|; = (Ei M? is the Frobenius norm. 

Smoothness. Smoothness is another widely adopted assumption on natural graph 

signals. Given a set of graph signals X € R”*? defined on an undirected weighted 

graph with an adjacency matrix A € R”*%”, the smoothness of the graph signals is 

usually measured by the Dirichlet energy (Belkin and Niyogi, 2002), 


1 
Q(A,X) = 5 LAL lIXi —Xjl|? = (X "LX) (14.3) 
LJ 


where L is the Laplacian matrix and tr(-) denotes the trace of a matrix. Lake and 
Tenenbaum (2010); Kalofolias (2016) proposed to learn a graph by minimizing 
Q(A,X) which forces neighboring vertices to have similar features, thus enforcing 
graph signals to change smoothly on the learned graph. Notably, solely minimizing 
the above smoothness loss can lead to the trivial solution A = 0. 
Connectivity and Sparsity. In order to avoid the trivial solution caused by solely 
minimizing the smoothness loss, Kalofolias (2016) imposed additional constraints 
on the learned graph, 

—al'log(A1) + BIA] |z (14.4) 


where the first term penalizes the formation of disconnected graphs via the logarith- 
mic barrier, and the second term controls sparsity by penalizing large degrees due 
to the first term. Note that I denotes the all-ones vector. As a result, this improves 
the overall connectivity of the graph, without compromising sparsity. 

Similarly, Dong et al (2016) proposed to solve the following optimization prob- 
lem: 
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min ||X—Y||} + @tr(Y LY) + BILIZ 
LER"*" YER"*P 


s.t. tr(L) =n, (14.5) 
Ljj=Lji<0, iF fj, 
L-1=0 


which is equivalent to finding jointly the graph Laplacian L and Y (i.e., a “noise- 
less” version of the zero-mean observation X), such that Y is close to X, and in the 
meantime Y is smooth on the sparse graph. Note that the first constraint acts as a 
normalization factor and permits to avoid trivial solutions, and the second and third 
constraints guarantee that the learned L is a valid Laplacian matrix that is positive 
semidefinite. 

Ying et al (2020a) aimed to learn a sparse graph under Laplacian constrained 
Gaussian graphical model, and proposed a nonconvex penalized maximum likeli- 
hood method by solving a sequence of weighted 11-norm regularized sub-problems. 
Maretic et al (2017) proposed to learn a sparse graph signal model by alternating 
between a signal sparse coding and a graph update step. 

In order to reduce the computational complexity of solving the optimization 
problem, many approximation techniques (Daitch et al, 2009; Kalofolias and Per- 
raudin, 2019; Berger et al, 2020) have been explored. Dong et al (2019) provided a 
good literature review on learning graphs from data from a GSP perspective. 


14.2.1.2 Spectral Clustering via Graph Structure Learning 


Graph structure learning has also been studied in the field of clustering analysis. 
For example, in order to improve the robustness of spectral clustering methods for 
noisy input data, Bojchevski et al (2017) assumed that the observed graph A can be 
decomposed into the corrupted graph A‘ and the good (i.e., clean) graph A£, and it 
is beneficial to only perform the spectral clustering on the clean graph. They hence 
proposed to jointly perform the spectral clustering and the decomposition of the ob- 
served graph, and adopted a highly efficient block coordinate-descent (alternating) 
optimization scheme to approximate the objective function. Huang et al (2019b) 
proposed a multi-view learning model which simultaneously conducts multi-view 
clustering and learns similarity relationships between data points in kernel spaces. 


14.2.2 Supervised Graph Structure Learning 


The task of supervised graph structure learning aims to learn a graph structure from 
data in a supervised manner. They may or may not consider a particular downstream 
prediction task during the model training phase. 
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14.2.2.1 Relational Inference for Interacting Systems 


Relational inference for interacting systems aims to study how objects in com- 
plex systems interact. Early works considered a fixed or fully-connected interaction 
graph (Battaglia et al, 2016; van Steenkiste et al, 2018) while modeling the interac- 
tion dynamics among objects. Sukhbaatar et al (2016) proposed a neural model to 
learn continuous communication among a dynamically changing set of agents where 
the communication graph changes over time as agents move, enter and exit the envi- 
ronment. Recent efforts (Kipf et al, 2018; Li et al, 2020a) have been made to simul- 
taneously infer the latent interaction graph and model the interaction dynamics. Kipf 
et al (2018) proposed a variational autoencoder (VAE) (Kingma and Welling, 2014) 
based approach which learns to infer the interaction graph structure and model the 
interaction dynamics among physical objects simultaneously from their observed 
trajectories in an unsupervised manner. The discrete latent code of VAE represents 
edge connections of the latent interaction graph, and both the encoder and decoder 
take the form of a GNN to model the interaction dynamics among objects. Because 
the latent distribution of VAE is discrete, the authors adopted a continuous relax- 
ation in order to use the reparameterization trick (Kingma et al, 2014). While Kipf 
et al (2018) focused on inferring a static interaction graph, Li et al (2020a) designed 
a dynamic mechanism to evolve the latent interaction graph adaptively over time. A 
Gated Recurrent Unit (GRU) (Cho et al, 2014a) was applied to capture the history 
information and adjust the prior interaction graph. 


14.2.2.2 Structure Learning in Bayesian Networks 


A Bayesian network (BN) is a Probabilistic Graphical Model (PGM) which en- 
codes conditional dependencies between random variables via a directed acyclic 
graph (DAG), where each random variable is represented as a node in DAG. The 
problem of learning the BN structure is important yet challenging in Bayesian net- 
works research. Most existing works on BN learning focus on score-based learn- 
ing of DAGs, and aim to find a DAG with the maximal score where a score indi- 
cates how well any candidate DAG is supported by the observed data (and any prior 
knowledge). Early works treat BN learning as a combinatorial optimization problem 
which is NP-hard due to the intractable search space of DAGs scaling superexpo- 
nentially with the number of nodes. Some efficient methods have been proposed 
for exact BN learning via dynamic programming (Koivisto and Sood, 2004; Silan- 
der and Myllymäki, 2006) or integer programming (Jaakkola et al, 2010; Cussens, 
2011). Recently, Zheng et al (2018b) proposed to formulate the traditional combi- 
natorial optimization problem into a purely continuous optimization problem over 
real matrices with a smooth equality constraint ensuring acyclicity of the graph. The 
resulting problem can hence be efficiently solved by standard numerical algorithms. 
A follow-up work (Yu et al, 2019a) leveraged the expressive power of GNNs, and 
proposed a variational autoencoder (VAE) based deep generative model with a vari- 
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ant of the structural constraint to learn the DAG. The VAE was parameterized by a 
GNN that can naturally handle both discrete and vector-valued random variables. 


14.3 Graph Structure Learning for Graph Neural Networks 


Graph structure learning has recently been revisited in the field of GNNs so as to 
handle the scenarios where the graph-structured data is noisy or unavailable. Recent 
attempts in this line of research mainly focus on joint learning of graph structures 
and representations without resorting to human effort or domain expertise. fig. 14.1 
shows the overview of graph structure learning for GNNs. Besides, we see several 
important problems being actively studied (including graph generation, graph ad- 
versarial defenses and transformer models) in recent years which are closely related 
to graph structure learning for GNNs. We will discuss their connections and differ- 
ences in this section. 
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Fig. 14.1: The overview of graph structure learning for GNNs. 
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14.3.1 Joint Graph Structure and Representation Learning 


In recent practice of GNNs, joint graph structure and representation learning has 
drawn a growing attention. This line of research aims to jointly optimize the graph 
structure and GNN parameters toward the downstream prediction task in an end- 
to-end manner, and can be roughly categorized into two groups: learning discrete 
graph structures and learning weighted adjacency matrices. The first kind of ap- 
proaches (Chen et al, 2018e; Ma et al, 2019b; Zhang et al, 2019d; Elinas et al, 
2020; Pal et al, 2020; Stanic et al, 2021; Franceschi et al, 2019; Kazi et al, 2020) 
operate by sampling a discrete graph structure (i.e., corresponding to a binary ad- 
jacency matrix) from the learned probabilistic adjacency matrix, and then feeding 
the graph to a subsequent GNN in order to obtain the task prediction. Because the 
sampling operation breaks the differentiability of the whole learning system, tech- 
niques such as variational inference (Hoffman et al, 2013) or Reinforcement Learn- 
ing (Williams, 1992) are applied to optimize the learning system. Considering that 
discrete graph structure learning often has the optimization difficulty introduced by 
the non-differentiable sampling operation and it is hence difficult to learn weights on 
edges, the other kind of approaches (Chen et al, 2020m; Li et al, 2018c; Chen et al, 
20200; Huang et al, 2020a; Liu et al, 2019b, 2021; Norcliffe-Brown et al, 2018) 
focuses on learning the weighted (and usually sparse) adjacency matrix associated 
to a weighted graph which will be later consumed by a subsequent GNN for the 
prediction task. We will discuss these two types of approaches in great detail next. 
Before discussing different techniques for joint graph structure and representation 
learning, let’s first formulate the joint graph structure and representation learning 
problem. 


14.3.1.1 Problem Formulation 


Let the graph Z = (V,&) be represented as a set of n nodes v; € Y with an initial 
node feature matrix X € R¢*”, and a set of m edges (v;,v j) € Ê (binary or weighted) 
formulating an initial noisy adjacency matrix A € R”*”, Given a noisy graph input 
G:={A (0) x } or only a node feature matrix X € R?*”, the joint graph structure and 
representation learning problem we consider aims to produce an optimized graph 
G* := {A“), X} and its corresponding node embeddings Z = f (*,@) € R’*”, with 
respect to certain downstream prediction task. Here, we denote f as a GNN and 0 
as its model parameters. 


14.3.1.2 Learning Discrete Graph Structures 


In order to deal with the issue of uncertainty on graphs, many of the existing works 
on learning discrete graph structures regard the graph structure as a random variable 
where a discrete graph structure can be sampled from certain probabilistic adja- 
cency matrix. They usually leverage various techniques such as variational infer- 
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ence (Chen et al, 2018e; Ma et al, 2019b; Zhang et al, 2019d; Elinas et al, 2020; Pal 
et al, 2020; Stanic et al, 2021), bilevel optimization (Franceschi et al, 2019), and Re- 
inforcement Learning (Kazi et al, 2020) to jointly optimize the graph structure and 
GNN parameters. Notably, they are often limited to the transductive learning setting 
where the node features and graph structure are fully observed during both the train- 
ing and inference stages. In this section, we introduce some representative works on 
this topic and show how they approach the problem from different perspectives. 

Franceschi et al (2019) proposed to jointly learn a discrete probability distribu- 
tion on the edges of the graph and the parameters of GNNs by treating the task as a 
bilevel optimization problem Colson et al (2007), formulated as, 


ao 4~Ber(3) [F (we,A)] 
eas (14.6) 
such that wọ = argmin,, Ey pr) [L(w,A)| 


where .# y denotes the convex hull of the set of all adjacency matrices for N nodes, 
and L(w,A) and F (wọ,A) are both task-specific loss functions measuring the differ- 
ence between GNN predictions and ground-truth labels which are computed on a 
training set and validation set, respectively. Each edge (i.e., node pair) of the graph 
is independently modeled as a Bernoulli random variable, and an adjacency matrix 
Aw Ber(6 ) can thus be sampled from the graph structure distribution parameterized 
by 6. The outer objective (i.e., the first objective) aims to find an optimal discrete 
graph structure given a GCN and the inner objective (i.e., the second objective) aims 
to find the optimal parameters wg of a GCN given a graph. The authors approxi- 
mately solved the above challenging bilevel problem with hypergradient descent. 
Considering that real-word graphs are often noisy, Ma et al (2019b) viewed the 
node features, graph structure and node labels as random variables, and modeled the 
joint distribution of them with a flexible generative model for the graph-based semi- 
supervised learning problem. Inspired by random graph models from the network 
science field (Newman, 2010), they assumed that the graph is generated based on 
node features and labels, and thus factored the joint distribution as the following: 


P(X,Y,G) = pg(G|X,Y) pg (VX) p(X) (14.7) 


where X, Y and G are random variables corresponding to the node features, labels 
and graph structure, and @ are learnable model parameters. Note that the condi- 


tional probabilities p EGY) and Pyy) can be any flexible parametric families of 


distributions as long as they are differentiable almost everywhere w.r.t. 6. In the 
paper, Picx) is instantiated with either latent space model (LSM) (Hoff et al, 
2002) or stochastic block models (SBM) (Holland et al, 1983). During the inference 
stage, in order to infer the missing node labels denoted as Ymiss, the authors lever- 
aged the recent advances in scalable variational inference (Kingma and Welling, 


2014; Kingma et al, 2014) to approximate the posterior distribution p 8 (YmiselX Yons,G) 


via a recognition model q a parameterized by @ where Yop; denotes the 


Ymniss |X Yobs ,G) 
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observed node labels. In the paper, qz i is instantiated with a GNN. The 


Ymiss|X, Yobs .G) 
model parameters 6 and $ are jointly optimized by maximizing the Evidence Lower 
Bound (Bishop, 2006) of the observed data (Yobs, G) conditioned on X. 

Elinas et al (2020) aimed to maximize the posterior over the binary adjacency 
matrix given the observed data (i.e., node features X and observed node labels Y°), 
formulated as, 

P(A|X,Y°) = Dé yoy apta) (14.8) 


where pz jive is a conditional likelihood which can be further factorized follow- 


X,A) 
ing the Bal independence assumption, 


6(Y°|X,A) — = oa OilX.A) 
(14.9) 
Põoilx A) — Catlyilž.) 
where Cat(y;|7;) denotes a categorical distribution, and is the i-th row of a probabil- 
ity matrix IT € RY*C modeled by a GCN, namely, IT = GCN(X,A, 6). As for the 
prior distribution over the graph p(A), the authors considered the following form, 


A) =| ]p(4i,) 


| (14.10) 
p(Ai,j) = Bern(A;,;|p?;) 


where Bern(Aj,;|P?;) is a Bernoulli distribution over the adjacency matrix A;,; with 


parameter P?; . In the paper, p?; = p1Ai,; + p2(1 — A;, j) was constructed to encode 
the degree of belief on the absence and presence of observed links with hyperpa- 
rameters 0 < p1,p2 < 0. Note that Ajj j is the observed graph structure which can 
potentially be perturbed. If there is no input graph available, a KNN graph can be 
employed. Given the above formulations, the authors developed a stochastic varia- 
tional inference algorithm by leveraging the reparameterization trick (Kingma et al, 
2014) and Concrete distributions techniques (Maddison et al, 2017; Jang et al, 2017) 
to optimize the graph posterior p(A|X,Y°) and the GCN parameters 6 jointly. 

Kazi et al (2020) designed a probabilistic graph generator whose underlying 
probability distribution is computed based on pair-wise node similarity, formulated 
as, 

Pij = etllXi-Xyl| (14.11) 


where ż is a temperature parameter, and X; is the node embedding of node v;. Given 
the above edge probability distribution, they adopted the Gumbel-Top-k trick (Kool 
et al, 2019) to sample an unweighted KNN graph which would be fed into a GNN- 
based prediction network. Note that the sampling operation breaks the differentia- 
bility of the model, the authors thus exploited Reinforcement Learning to reward 
edges involved in a correct classification and penalize edges which led to misclassi- 
fication. 
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14.3.1.3 Learning Weighted Graph Structures 


Unlike the kind of graph structure learning approaches focusing on learning a dis- 
crete graph structure (i.e., binary adjacency matrix) for the GNN, there is a class of 
approaches instead focusing on learning a weighted graph structure (i.e., weighted 
adjacency matrix). In comparison with learning a discrete graph structure, learning 
a weighted graph structure has several advantages. Firstly, optimizing a weighted 
adjacency matrix is much more tractable than optimizing a binary adjacency matrix 
because the former can be easily achieved by SGD techniques (Bottou, 1998) or 
even convex optimization techniques (Boyd et al, 2004) while the later often has to 
resort to more challenging techniques such as variational inference (Hoffman et al, 
2013), Reinforcement Learning (Williams, 1992) and combinatorial optimization 
techniques (Korte et al, 2011) due to its non-differentiability. Secondly, a weighted 
adjacency matrix is able to encode richer information on edges compared to a binary 
adjacency matrix, which could benefit the subsequent graph representation learning. 
For example, the widely used Graph Attention Network (GAT) (Veličković et al, 
2018) essentially aims to learn edge weights for the input binary adjacency matrix 
which benefit the subsequent message passing operations. In this subsection, we 
will first introduce some common graph similarity metric learning techniques as 
well as graph sparsification techniques widely used in existing works for learning 
a Sparse weighted graph by considering pair-wise node similarity in the embedding 
space. Some representative graph regularization techniques will be later introduced 
for controlling the quality of the learned graph structure. We will then discuss the 
importance of combining both of the intrinsic graph structures and learned implicit 
graph structures for better learning performance. Finally, we will cover some im- 
portant learning paradigms for the joint learning of graph structures and graph rep- 
resentations that have been successfully adopted by existing works. 


Graph Similarity Metric Learning Techniques 


As introduced in section 14.2.1.1, prior works on unsupervised graph structure 
learning from smooth signals also aim to learn a weighted adjacency matrix from 
data. Nevertheless, they are incapable of handling inductive learning setting where 
there are unseen graphs or nodes in the inference phase. This is because they of- 
ten learn by directly optimizing an adjacency matrix based on certain prior con- 
straints on the graph properties. Many works on discrete graph structure learning 
(section 14.3.1.2) have trouble conducting inductive learning as well on account of 
the similar reason. 

Inspired by the success of attention-based techniques (Vaswani et al, 2017; 
Veličković et al, 2018) for modeling relationships among objects, many recent 
works in the literature cast graph structure learning as similarity metric learning 
defined upon the node embedding space assuming that the node attributes more or 
less contain useful information for inferring the implicit topological structure of the 
graph. One biggest advantage of this strategy is that the learned similarity metric 
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function can be later applied to an unseen set of node embeddings to infer a graph 
structure, thus enabling inductive graph structure learning. 

For data deployed in non-Euclidean domains such as graph data, the Euclidean 
distance is not necessarily the optimal metric for measuring node similarity. Com- 
mon options for metric learning include cosine similarity (Nguyen and Bai, 2010), 
radial basis function (RBF) kernel (Yeung and Chang, 2007) and attention mech- 
anisms (Bahdanau et al, 2015; Vaswani et al, 2017). In general, according to the 
types of raw information sources needed, we group the similarity metric learning 
functions into two categories: Node Embedding Based Similarity Metric Learning 
and Structure-aware Similarity Metric Learning. Next, we will introduce some rep- 
resentative metric learning functions from both categories which have been success- 
fully adopted in prior works on graph structure learning for GNNs. 


Node Embedding Based Similarity Metric Learning 


Node embedding based similarity metric learning functions are designed to learn a 
pair-wise node similarity matrix based on node embeddings which ideally encode 
important semantic meanings of the nodes for graph structure learning. 


Attention-based Similarity Metric Functions Most similarity metric functions 
proposed so far are based on the attention mechanism Bahdanau et al (2015); 
Vaswani et al (2017). Norcliffe-Brown et al (2018) adopted a simple metric function 
which computes the dot product between any pair of node embeddings (eq. (14.12)). 
Given its limited learning capacity, it might have difficulty learning an optimal graph 
structure. 

Sip =v) V; (14.12) 


where S € R”*” is a node similarity matrix, and V; is the vector representation of 
node vj. 

To enrich the learning capacity of dot product, Chen et al (2020n) proposed a 
modified dot product by introducing learnable parameters, formulated as follows: 


Sij = (GON); (14.13) 


where © denotes element-wise multiplication, and i is a non-negative trainable 
weight vector which learns to highlight different dimensions of the node embed- 
dings. Note that the output similarity matrix S is asymmetric. 

Chen et al (20200) proposed a more expressive version of dot product by intro- 
ducing a weight matrix, formulated as follows: 


Si; = ReLU(W¥;) ReLU(W7;) (14.14) 


where W is ad x d weight matrix, and ReLU(x) = max(0,x) is a rectified linear unit 
(ReLU) (Nair and Hinton, 2010) which is used here to enforce the sparsity of the 
output similarity matrix. 

Similar to (Chen et al, 20200), On et al (2020) introduced a learnable mapping 
function to node embeddings before computing the dot product, and applied a RELU 
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function to enforce sparsity, formulated as follows: 
Sij =ReLU(f(¥;)' f(¥;)) (14.15) 


where f : R > R is a single-layer feed-forward network without non-linear activa- 
tion. 

Besides using ReLU to enforce sparsity, Yang et al (2018c) applied the square 
operation to stabilize training, and the row-normalization operation to obtain a nor- 
malized similarity matrix, formulated as follows: 


© (ReLU((Mi 4) W27; +b)? (14.16) 
"O Ep (ReLU((Wi¥,) | Wav; +b)? 


where W; and W, are d x d weight matrices, and b is a scalar parameter. 

Unlike Chen et al (20200) that applied the same linear transformation to node 
embeddings, Huang et al (2020a) applied different linear transformations to the two 
node embeddings when computing the pair-wise node similarity, formulated as fol- 
lows: 

Si, į = softmax((W17;) ' W¥;) (14.17) 
či 
j e 


where W; and W, are d x d weight matrices, and softmax (Z); = Ee is applied to 


obtain a row-normalized similarity matrix. 
Velickovic et al (2020) aimed at graph structure learning in a temporal setting 


where the implicit graph structure to be learned changes over time. At each time 


step ż, they first computed the pair-wise node similarity a using the same attention 
mechanism as in (Huang et al, 2020a), and based on that, they further obtained an 


(t) 


“aggregated” adjacency matrix Se by deriving a new edge for node i by choosing 


node j with the maximal d;;. The whole process is formulated as follows: 


a® = softmax( (W1 1) Tw) 


A 
W uO 0) 


__,,(t)e(t-1) 
ij=H Si; +Our I 
st) = 3) 


D (14.18) 


j=argmax, (a; ġ 


i, 


where ul? is a learnable binary gating mask, V denotes logical disjunction between 
the two operands to enforce symmetry, and W; and W, are d x d weight matrices. 
Because the argmax operation makes the whole learning system non-differentiable, 
the authors provided the ground-truth graph structures for supervision at each time 
step. 


Cosine-based Similarity Metric Functions Chen et al (2020m) proposed a multi- 
head weighted cosine similarity function which aims at capturing pair-wise node 
similarity from multiple perspectives, formulated as follows: 
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SP, = cos(Wp © Vj, Wp OV;) 
12 (14.19) 


where w, is a learnable weight vector associated to the p-th perspective, and has the 
same dimension as the node embeddings. Intuitively, S? j computes the pair-wise 
cosine similarity for the p-th perspective where each perspective considers one part 
of the semantics captured in the embeddings. Moreover, as observed in (Vaswani 
et al, 2017; Veličković et al, 2018), employing multi-head learners is able to stabilize 
the learning process and increase the learning capacity. 


Kernel-based Similarity Metric Functions Besides attention-based and cosine- 
based similarity metric functions, researchers also explored to apply kernel-based 
metric functions for graph structure learning. Li et al (2018c) applied a Gaussian 
kernel to the distance between any pair of node embeddings, formulated as follows: 


dW.) = y 0i- 7;) TMO: -¥)) 
—d (Vi, Vj) 

20? 
where o is a scalar hyperparameter which determines the width of the Gaussian 
kernel, and d(¥;,¥;) computes the Mahalanobis distance between the two node em- 
beddings V; and v;. Notably, M is the covariance matrix of the node embeddings 
distribution if we assume all the node embeddings of the graph are drawn from 
the same distribution. If we set M = I, the Mahalanobis distance reduces to the 
Euclidean distance. To make M a symmetric and positive semi-definite matrix, the 
authors let M = WW! where W is ad x d learnable weight matrix. We can also re- 
gard W as the transform basis to the space where we measure the Euclidean distance 
between two vectors. 

Similarly, Henaff et al (2015) first computed the Euclidean distance between 
any pair of node embeddings, and then applied a Gaussian Kernel or a self-tuning 
diffusion kernel (Zelnik-Manor and Perona, 2004), formulated as follows: 


(14.20) 
S(T) = 


ver —d (Vi, V 
Svj) = e i) (14.21) 
ph —d (Vi, V 
Siocai (Vi, Vj) = n i 
iOj 


where Stocai(¥;, vj) defines a self-tuning diffusion kernel whose variance is locally 
adapted around each node. Specifically, ©; is computed as the distance d (V;, ¥;, ) 
corresponding to the k-th nearest neighbor i; of node i. 
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Structure-aware Similarity Metric Learning 


When learning implicit graph structures from data, it might be beneficial to utilize 
the intrinsic graph structures as well if they are available. 


Utilizing Intrinsic Edge Embeddings for Similarity Metric Learning Inspired 
by recent works on structure-aware transformers (Zhu et al, 2019b; Cai and Lam, 
2020) which brought the intrinsic graph structure to the self-attention mechanism in 
the transformer architecture, some works designed structure-aware similarity metric 
functions which additionally consider the edge embeddings of the intrinsic graph. 
Liu et al (2019b) introduced a structure-aware attention mechanism as the following: 
Si, = softmax(i' tanh(W [hj h’,,¥;,0;,2:j])) (14.22) 
where V; denotes the node attributes for node i, é; j represents the edge attributes 
between node i and j, ht is the vector representation of node i in the /-th GNN layer, 
and z and W are trainable weight vector and weight matrix, respecitively. 
Similarly, Liu et al (2021) proposed a structure-aware global attention mecha- 
nism for learning pair-wise node similarity, formulated as follows, 


S= ReLU(W2¥;) ' (ReLU(W*¥;) + ReLU(W*; ;)) 
J Jd 
where £; j € IR“ is the embedding of the edge connecting node i and j, W2,W* € 


R¢x4, WR © RIXd are learnable weight matrices, and d, d, and de are the dimen- 
sions of hidden vectors, node embeddings and edge embeddings, respectively. 


(14.23) 


Utilizing Intrinsic Edge Connectivity Information for Similarity Metric Learn- 
ing In the case where only the edge connectivity information is available in the in- 
trinsic graph, Jiang et al (2019b) proposed a masked attention mechanism for graph 
structure learning, formulated as follows, 


Aij exp(ReLU(a! |V; — v;l)) 


Sij = = 
7 YeAix exp(ReLU(i" |V; — Ve) 


(14.24) 


where A; j is the adjacency matrix of the intrinsic graph and z is a weight vec- 
tor with the same dimension as node embeddings V;. This idea of using masked 
attention to incorporate the initial graph topology shares the same spirit with the 
GAT (Veličković et al, 2018) model. 


Graph Sparsification Techniques 


The aforementioned similarity metric learning functions all return a weighted ad- 
jacency matrix associated to a fully-connected graph. A fully-connected graph is 
not only computationally expensive but also might introduce noise such as unim- 
portant edges. In real-word applications, most graph structures are much more 
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sparse. Therefore, it can be beneficial to explicitly enforce sparsity to the learned 
graph structure. Besides applying the ReLU function in the similarity metric func- 
tions (Chen et al, 20200; On et al, 2020; Yang et al, 2018c; Liu et al, 2021; Jiang 
et al, 2019b), various graph sparsification techniques have been adopted to enhance 
the sparsity of the learned graph structure. 

Norcliffe-Brown et al (2018); Klicpera et al (2019b); Chen et al (20200,n); Yu 
et al (2021a) adopted a KNN style sparsification operation to obtain a sparse ad- 
jacency matrix from the node similarity matrix computed by the similarity metric 
learning function, formulated as follows: 


Ai; = topk(S;,;) (14.25) 


where topk is a KNN-style operation. Specifically, for each node, only the K nearest 
neighbors (including itself) and the associated similarity scores are kept, and the 
remaining similarity scores are masked off. 

Klicpera et al (2019b); Chen et al (2020m) enforced a sparse adjacency matrix 
by considering only the €-neighborhood for each node, formulated as follows: 


JS Sig > € 
Aij = . otherwise (14.20) 
where those elements in S which are smaller than a non-negative threshold € are all 
masked off (i.e., set to zero). 


Graph Regularization Techniques 


As discussed earlier, many works in the field of Graph Signal Processing typically 
learn the graph structure from data by directly optimizing the adjacency matrix to 
minimize the constraints defined based on certain graph properties, without con- 
sidering any downstream tasks. On the contrary, many works on graph structure 
learning for GNNs aim to optimize a similarity metric learning function (for learn- 
ing graph structures) toward the downstream prediction task. However, they do not 
explicitly enforce the learned graph structure to have some common properties (e.g., 
smoothness) presented in real-word graphs. 

Chen et al (2020m) proposed to optimize the graph structures by minimizing a 
hybrid loss function combining both the task prediction loss and the graph regular- 
ization loss. They explored three types of graph regularization losses which pose 
constrains on the smoothness, connectivity and sparsity of the learned graph. 


Smoothness The smoothness property assumes neighboring nodes to have similar 


features. i i 
Q(A,X) = 55 DALI — XII? = -ztr(X7LX) 1427) 
i,j 


where tr(-) denotes the trace of a matrix, L = D—A is the graph Laplacian, and D; ; = 
L;A; j is the degree matrix. As can be seen, minimizing Q (A,X) forces adjacent 
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nodes to have similar features, thus enforcing smoothness of the graph signals on 
the graph associated with A. However, solely minimizing the smoothness loss will 
result in the trivial solution A = 0. We might also want to pose other constraints to 
the graph. 


Connectivity The following equation penalizes the formation of disconnected 
graphs via the logarithmic barrier. 


-1> 4 
—1'log(Al) (14.28) 
n 


where n is the number of nodes. 


Sparsity The following equation controls sparsity by penalizing large degrees. 
1 
—||A]|2 14.29 
llall (14.29) 


where ||- ||7 denotes the Frobenius norm of a matrix. 

In practice, solely minimizing one type of graph regularization losses might not 
be desirable. For instance, solely minimizing the smoothness loss will result in the 
trivial solution A = 0. Therefore, it could be beneficial to balance the trade-off 
among different types of desired graph properties by computing a linear combi- 
nation of the various graph regularization losses, formulated as follows: 

Str(XTEX) + a i log(A1) + Alle (14.30) 
where a, P and y are all non-negative hyperparameters for controlling the smooth- 
ness, connectivity and sparsity of the learned graph. 

Besides the above graph regularization techniques, other prior assumptions such 
as neighboring nodes tend to share the same label (Yang et al, 2019c) and learned 
implicit adjacency matrix should be close to the intrinsic adjacency matrix (Jiang 
et al, 2019b) have been adopted in the literature. 


Combining Intrinsic Graph Structures and Implicit Graph Structures 


Recall that one of the most important motivations for graph structure learning is 
that the intrinsic graph structure (if it is available) might be error-prone (e.g., noisy 
or incomplete) and sub-optimal for the downstream prediction task. However, the 
intrinsic graph typically still carries rich and useful information regarding the opti- 
mal graph structure for the downstream task. Hence, it could be harmful to totally 
discard the intrinsic graph structure. 

A few recent works (Li et al, 2018c; Chen et al, 2020m; Liu et al, 2021) proposed 
to combine the learned implicit graph structure with the intrinsic graph structure for 
better downstream prediction performance. The rationales are as follows. First of 
all, they assume that the optimized graph structure is potentially a “shift” (e.g., sub- 
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structures) from the intrinsic graph structure, and the similarity metric function is in- 
tended to learn such a “shift” which is supplementary to the intrinsic graph structure. 
Secondly, incorporating the intrinsic graph structure can help accelerate the training 
process and increase the training stability considering there is no prior knowledge 
on the similarity metric, the trainable parameters are randomly initialized, and thus 
it may take long to converge. 

Different ways for combining intrinsic and implicit graph structures have been 
proposed. For instance, Li et al (2018c); Chen et al (2020m) proposed to compute a 
linear combination of the normalized graph Laplacian of the intrinsic graph structure 
and the normalized adjacency matrix of the implicit graph structure, formulated as 
follows: 

A=AL + (1—A)f(A) (14.31) 


where LC) is the normalized graph Laplacian matrix, f(A) is the normalized adja- 
cency matrix associated to the learned implicit graph structure, and À is a hyperpa- 
rameter controlling the trade-off between the intrinsic and implicit graph structures. 
Note that f : R’*” —> R”*" can be arbitrary normalization operations such as graph 
Laplacian operation and row-normalization operation. Liu et al (2021) proposed a 
hybrid message passing mechanism for GNNs which fuses the two aggregated node 
vectors from the intrinsic graph and the learned implicit graph, respectively, and 
then feed the fused vector to a GRU (Cho et al, 2014a) to update node embeddings. 


Learning Paradigms 


Most existing methods for graph structure learning for GNNs consist of two key 
learning components: graph structure learning (i.e., similarity metric learning) and 
graph representation learning (i.e., GNN module), and the ultimate goal is to learn 
the optimized graph structures and representations with respect to certain down- 
stream prediction task. How to optimize the two separate learning components to- 
ward the same ultimate goal becomes an important question? 


Joint Learning of Graph Structures and Representations 


The most straightforward strategy is to jointly optimize the whole learning system 
in an end-to-end manner toward the downstream prediction task which provides 
certain form of supervision, as illustrated in fig. 14.2. Jiang et al (2019b); Yang et al 
(2019c); Chen et al (2020m) designed a hybrid loss function combining both the task 
prediction loss and the graph regularization loss, namely, Z = -Zyreq + Ly. The aim 
of introducing the graph regularization loss is to bring some prior knowledge to the 
graph properties (e.g., smoothness, sparsity) as we discussed above so as to enforce 
learning more meaningful graph structures and alleviate the potential overfitting 
issue. 
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Node features & (optional) Downstream task 
initial graph structure prediction 


Learned graph 
structure 


GNN 


Fig. 14.2: Joint learning paradigm. 


Adaptive Learning of Graph Structures and Representations 


Node features & (optional) Downstream task 
initial graph structure prediction 
Learned graph Node Learned graph 
structure 1 embeddings 1 structure N 


Repeat for fixed num. of stacked GNN layers 


Fig. 14.3: Adaptive learning paradigm. 


It is common practice to sequentially stack multiple GNN layers so as to cap- 
ture long-range dependencies in a graph. As a result, the graph representations up- 
dated by one GNN layer will be consumed by the next GNN layer as the initial 
graph representations. Since input graph representations to each GNN layer are 
transformed by the previous GNN layer, one may naturally think whether the in- 
put graph structure to each GNN layer should be adaptively adjusted to reflect the 
changes of the graph representations, as illustrated in fig. 14.3. One such example 
is the GAT (Veličković et al, 2018) model which adatptively reweights the impor- 
tance of neighboring node embeddings by applying the self-attention mechanism to 
the previously updated node embeddings when performing neighborhood aggrega- 
tion at each GAT layer. However, the GAT model does not update the connectivity 
information of the intrinsic graph. In the literature of graph structure learning for 
GNNs, some methods (Yang et al, 2018c; Liu et al, 2019b; Huang et al, 2020a; 
Saire and Ramirez Rivera, 2019) also operate by adaptively learning a graph struc- 
ture for every GNN layer based on the updated graph representations produced by 


316 Yu Chen and Lingfei Wu 


the previous GNN layer. And the whole learning system is usually jointly optimized 
in an end-to-end manner toward the downstream prediction task. 


Iterative Learning of Graph Structures and Representations 


Node features & (optional) Downstream task 
initial graph structure prediction 


Learned graph 
T structure 


Node embeddings 


Repeat until condition satisfied 


Fig. 14.4: Iterative learning paradigm. 


Both of aforementioned joint learning and adaptive learning paradigms aim to 
learn a graph structure by applying a similarity metric function to the graph rep- 
resentations in a one-shot effort. Even though the adaptive learning paradigm aims 
to learn a graph structure at each GNN layer based on the updated graph represen- 
tations, the graph structure learning procedure at each GNN layer is still one-shot. 
One big limitation of such a one-shot graph structure learning paradigm is that the 
quality of the learned graph structure heavily relies on the quality of the graph rep- 
resentations. Most existing methods assume that raw node features capture a good 
amount of information about the graph topology, which unfortunately is not always 
the case. Thus, it can be challenging to learn good implicit graph structures from 
the raw node features which do not contain adequate information about the graph 
topology. 

Chen et al (2020m) proposed a novel end-to-end graph learning framework, 
dubbed as IDGL, for jointly and iteratively learning graph structures and represen- 
tations. As illustrated in fig. 14.4, the IDGL framework operates by learning a better 
graph structure based on better graph representations, and in the meantime, learning 
better graph representations based on a better graph structure in an iterative manner. 
More specifically, the IDGL framework iteratively searches for an implicit graph 
structure that augments the intrinsic graph structure (if not available, a KNN graph 
is used) which is optimized for the downstream prediction task. And this iterative 
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learning procedure dynamically stops when the learned graph structure approaches 
close enough to the optimized graph (with respect to the downstream task) according 
to certain stopping criterion (1.e., the difference between learned adjacency matrices 
at consecutive iterations are smaller than certain threshold). At each iteration, a hy- 
brid loss combining both the task prediction loss and the graph regularization loss 
is added to the overall loss. After all iterations, the overall loss is back-propagated 
through all previous iterations to update model parameters. 

This iterative learning paradigm for repeatedly refining the graph structure and 
graph representations has a few advantages. On the one hand, even when the raw 
node features do not contain adequate information for learning implicit relation- 
ships among nodes, the node embeddings learned by the graph representation learn- 
ing component could ideally provide useful information for learning a better graph 
structure because these node embeddings are optimized toward the downstream 
task. On the other hand, the newly learned graph structure could be a better graph 
input for the graph representation learning component to learn better node embed- 
dings. 


14.3.2 Connections to Other Problems 


Graph structure learning for GNNs has interesting connections to a few important 
problems. Thinking about these connections might spur further research in those 
areas. 


14.3.2.1 Graph Structure Learning as Graph Generation 


The task of graph generation focuses on generating realistic and meaningful graphs. 
The early works of graph generation formalized the problem as a stochastic gen- 
eration process, and proposed various random graph models for generating a pre- 
selected family of graphs such as ER graphs (Erdős and Rényi, 1959), small-world 
networks (Watts and Strogatz, 1998), and scale-free graphs (Albert and Barabasi, 
2002). However, these approaches typically make certain simplified and carefully- 
designed apriori assumptions on graph properties, and thus in general have limited 
modeling capacity on complex graph structures. Recent attempts focus on building 
deep generative models for graphs by leveraging RNN You et al (2018b), VAE (Jin 
et al, 2018a), GAN (Wang et al, 2018a), flow-based techniques (Shi et al, 2019a) and 
other specially designed models (You et al, 2018a). And GNNs are usually adopted 
by these models as a powerful graph encoder. 

Even though the graph generation task and the graph structure learning task 
both focus on learning graphs from data, they have essentially different goals and 
methodologies. Firstly, the graph generation task aims to generate new graphs where 
both nodes and edges are added to together construct a meaningful graph. However 
the graph structure learning task aims to learn a graph structure given a set of node 
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attributes. Secondly, generative models for graphs typically operate by learning the 
distribution from the observed set of graphs, and generating more realistic graphs 
by sampling from the learned graph distribution. But graph structure learning meth- 
ods typically operate by learning the pair-wise relationships among the given set 
of nodes, and based on that, building the graph topology. It will be an interesting 
research direction to study how the two tasks can help each other. 


14.3.2.2 Graph Structure Learning for Graph Adversarial Defenses 


Recent studies (Dai et al, 2018a; Ziigner et al, 2018) have shown that GNNs are 
vulnerable to carefully-crafted perturbations (a.k.a adversarial attacks), e.g., small 
deliberate perturbations in graph structures and node/edge attributes. Researchers 
working on building robust GNNs found graph structure learning a powerful tool 
against topology attacks. Given an initial graph whose topology might become un- 
reliable because of adversarial attacks, they leveraged graph structure learning tech- 
niques to recover the intrinsic graph topology from the poisoned graph. 

For instance, assuming that adversarial attacks are likely to violate some intrinsic 
graph properties (e.g., low-rank and sparsity), Jin et al (2020e) proposed to jointly 
learn the GNN model and the “clean” graph structure from the perturbed graph 
by optimizing some hybrid loss combining both the task prediction loss and the 
graph regularization loss. In order to restore the structure of the perturbed graph, 
Zhang and Zitnik (2020) designed a message-passing scheme that can detect fake 
edges, block them and then attend to true, unperturbed edges. In order to address 
the noise brought by the task-irrelevant information on real-life large graphs, Zheng 
et al (2020b) introduced a supervised graph sparsification technique to remove po- 
tentially task-irrelevant edges from input graphs. Chen et al (2020d) proposed a 
Label-Aware GCN (LAGCN) framework which can refine the graph structure (i.e., 
filtering distracting neighbors and adding valuable neighbors for each node) before 
the training of GCN. 

There are many connections between graph adversarial defenses and graph struc- 
ture learning. On the one hand, graph structure learning is partially motivated by im- 
proving potentially error-prone (e.g., noisy or incomplete) input graphs for GNNs, 
which share the similar spirit with graph adversarial defenses. On the other hand, 
the task of graph adversarial defenses can benefit from graph structure learning tech- 
niques as evidenced by some recent works. 

However, there is a key difference between their problem settings. The graph 
adversarial defenses task deals with the setting where the initial graph structure is 
available, but potentially poisoned by adversarial attacks. And the graph structure 
learning task aims to handle both the scenarios where the input graph structure is 
available or unavailable. Even when the input graph structure is available, one can 
still improve it by “denoising” the graph structure or augmenting the graph structure 
with an implicit graph structure which captures implicit relationships among nodes. 
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14.3.2.3 Understanding Transformers from a Graph Learning Perspective 


Transformer models (Vaswani et al, 2017) have been widely used as a powerful 
alternative to Recurrent Neural Networks, especially in the Natural Language Pro- 
cessing field. Recent studies (Choi et al, 2020) have shown the close connection be- 
tween transformer models and GNNs. By nature, transformer models aim to learn 
a self-attention matrix between every pair of objects, which can be thought as an 
adjacency matrix associated with a fully-connected graph containing each object as 
a node. Therefore, one can claim that transformer models also perform some sort 
of joint graph structure and representation learning, even though these models typi- 
cally do not consider any initial graph topology and do not control the quality of the 
learned fully-connected graph. Recently, many variants of the so-called graph trans- 
formers (Zhu et al, 2019b; Yao et al, 2020; Koncel-Kedziorski et al, 2019; Wang 
et al, 2020k; Cai and Lam, 2020) have been developed to combine the benefits of 
both GNNs and transformers. 


14.4 Future Directions 


In this section, we will introduce some advanced topics of graph structure learning 
for GNNs and highlight some promising future directions. 


14.4.1 Robust Graph Structure Learning 


Although one of the major motivations of developing graph structure learning tech- 
niques for GNNs is to handle noisy or incomplete input graphs, robustness does not 
lie in the heart of most existing graph structure learning techniques. Most of exist- 
ing works did not evaluate the robustness of their approaches to noisy initial graphs. 
Recent works showed that random edge addition or deletion attacks significantly 
downgraded the downstream task performance (Franceschi et al, 2019; Chen et al, 
2020m). Moreover, most existing works admit that the initial graph structure (if 
provided) might be noisy and thus unreliable for graph representation learning, but 
they still assume that node features are reliable for graph structure learning, which 
is often not true in real-world scenarios. Therefore, it is challenging yet rewarding to 
explore robust graph structure learning techniques for data with noisy initial graph 
structures and noisy node attributes. 
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14.4.2 Scalable Graph Structure Learning 


Most existing graph structure learning techniques need to model the pair-wise re- 
lationships among all the nodes in order to discover the hidden graph structure. 
Therefore, their time complexity is at least O(n?) where n is the number of graph 
nodes. This can be very expensive and even intractable for large-scale graphs (e.g., 
social networks) in real word. Recently, Chen et al (2020m) proposed a scalable 
graph structure learning approach by leveraging the anchor-based approximation 
technique to avoid explicitly computing the pair-wise node similarity, and achieved 
linear complexity in both computational time and memory consumption with respect 
to the number of graph nodes. In order to improve the scalability of transformer 
models, different kinds of approximation techniques have also been developed in 
recent works (Tsai et al, 2019; Katharopoulos et al, 2020; Choromanski et al, 2021; 
Peng et al, 2021; Shen et al, 2021; Wang et al, 2020g). Considering the close connec- 
tions between graph structure learning for GNNs and transformers, we believe there 
are many opportunities in building scalable graph structure learning techniques for 
GNNs. 


14.4.3 Graph Structure Learning for Heterogeneous Graphs 


Most existing graph structure learning works focus on learning homogeneous graph 
structures from data. In comparison with homogeneous graphs, heterogeneous 
graphs are able to carry on richer information on node types and edge types, and 
occur frequently in real-world graph-related applications. Graph structure learning 
for heterogeneous graphs is supposed to be more challenging because more types 
of information (e.g., node types, edge types) are expected to be learned from data. 
Some recent attempts (Yun et al, 2019; Zhao et al, 2021) have been made to learn 
graph structures from heterogeneous graphs. 


14.5 Summary 


In this chapter, we explored and discussed graph structure learning from multiple 
perspectives. We first reviewed the existing works on graph structure learning in the 
literature of traditional machine learning, including both unsupervised graph struc- 
ture learning and supervised graph structure learning. As for unsupervised graph 
structure learning, we mainly looked into some representative techniques devel- 
oped from the Graph Signal Processing community. We also introduced some recent 
works on clustering analysis that leveraged graph structure learning techniques. As 
for supervised graph structure learning, we introduced how this problem was studied 
in the research on modeling interacting systems and Bayesian Networks. The main 
focus of this chapter is on introducing recent advances in graph structure learning 
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for GNNs. We motivated graph structure learning in the GNN field by discussing the 
scenarios where the graph-structured data is noisy or unavailable. We then moved 
on to introduce recent research progress in joint graph structure and representa- 
tion learning, including learning discrete graph structures and learning weighted 
graph structures. The connections and differences between graph structure learning 
and other important problems such as graph generation, graph adversarial defenses 
and transformer models were also discussed. We then highlighted several remain- 
ing challenges and future directions in the research of graph structure learning for 
GNNs. 


Editor’s Notes: Graph Structure Learning is a fast-emerging research topic 
and have seen a significant number of interests in recent years. The key 
idea is to learn an optimized graph structure in order to generate a bet- 
ter node representation (Chapter 4) and a more robust node representation 
(Chapter 8). Obviously, the graph structure learning could be expensive if 
the common pair-wise learning approach is adopted and thus the scalability 
issue could be a real major concern (Chapter 6). Meanwhile, it has tight 
connection with graph generation (Chapter 11) and self-supervised learn- 
ing (Chapter 18), since they all consider partially how to modify/leverage 
graph structure. This chapter can be applicable to a broad range of appli- 
cation domains such as recommendation system (Chapter 19), computer 
vision (Chapter 20), Natural Language Processing (Chapter 21), Program 
Analysis (Chapter 22), and so on. 
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Abstract The world around us is composed of entities that interact and form re- 
lations with each other. This makes graphs an essential data representation and a 
crucial building-block for machine learning applications; the nodes of the graph 
correspond to entities and the edges correspond to interactions and relations. The 
entities and relations may evolve; e.g., new entities may appear, entity properties 
may change, and new relations may be formed between two entities. This gives rise 
to dynamic graphs. In applications where dynamic graphs arise, there often exists 
important information within the evolution of the graph, and modeling and exploit- 
ing such information is crucial in achieving high predictive performance. In this 
chapter, we characterize various categories of dynamic graph modeling problems. 
Then we describe some of the prominent extensions of graph neural networks to dy- 
namic graphs that have been proposed in the literature. We conclude by reviewing 
three notable applications of dynamic graph neural networks namely skeleton-based 
human activity recognition, traffic forecasting, and temporal knowledge graph com- 
pletion. 


15.1 Introduction 


Traditionally, machine learning models were developed to make predictions about 
entities (or objects or examples) given only their features and irrespective of their 
connections with the other entities in the data. Examples of such prediction tasks 
include predicting the political party a social network user supports given their other 
features, predicting the topic of a publication given its text, predicting the type of 
the object in an image given the image pixels, and predicting the traffic in a road (or 
road segment) given historical traffic data in that road. 
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In many applications, there exist relationships between the entities that can be 
exploited to make better predictions about them. As a few examples, social network 
users that are close friends or family members are more likely to support the same 
political party, two publications by the same author are more likely to have the same 
topic, two images taken from the same website (or uploaded to social media by 
the same user) are more likely to have similar objects in them, and two roads that 
are connected are more likely to have similar traffic volumes. The data for these 
applications can be represented in the form of a graph where nodes correspond to 
entities and edges correspond to the relationships between these entities. 

Graphs arise naturally in many real-world applications including recommender 
systems, biology, social networks, ontologies, knowledge graphs, and computational 
finance. In some domains the graph is static, i.e. the graph structure and the node fea- 
tures are fixed over time. In other domains, the graph changes over time. In a social 
network, for example, new edges are added when people make new friends, exist- 
ing edges are removed when people stop being friends, and node features change 
as people change their attributes, e.g., when they change their career assuming that 
career is one of the node features. In this chapter, we focus on the domains where 
the graph is dynamic and changes over time. 

In applications where dynamic graphs arise, modeling the evolution of the graph 
is often crucial in making accurate predictions. Over the years, several classes of 
machine learning models have been developed that capture the structure and the 
evolution of dynamic graphs. Among these classes, extensions of graph neural net- 
works (GNNs) (Scarselli et al, 2008; Kipf and Welling, 2017b) to dynamic graphs 
have recently found success in several domains and they have become one of the 
essential tools in the machine learning toolbox. In this chapter, we review the GNN 
approaches for dynamic graphs and provide several application domains where dy- 
namic GNNs have provided striking results. The chapter is not meant to be a full 
survey of the literature but rather a description of the common techniques for apply- 
ing GNNs to dynamic graphs. For a comprehensive survey of representation learn- 
ing approaches for dynamic graphs we refer the reader to (Kazemi et al, 2020), and 
for a more specialized survey of GNN-based approaches to dynamic graphs we refer 
the reader to (Skarding et al, 2020). 

The rest of the chapter is organized as follows. In Section 15.2, we define the no- 
tation that will be used throughout the chapter and provide the necessary background 
to follow the rest of the chapter. In Section 15.3, we describe different types of dy- 
namic graphs and different prediction problems on these graphs. In Section 15.4, we 
review several approaches for applying GNNs on dynamic graphs. In Section 15.5, 
we review some of the applications of dynamic GNNs. Finally, Section 15.6 sum- 
marizes and concludes the chapter. 
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15.2 Background and Notation 


In this section, we define our notation and provide the background required to follow 
the rest of the chapter. 

We use lowercase letters z to denote scalars, bold lowercase letters z to denote 
vectors and uppercase letters Z to denote matrices. z; denotes the i element of z, 
Z; denotes a column vector corresponding to the i row of Z, and Z; j denotes the 
element at the i row and j column of Z. z denotes the transpose of z and Z denotes 
the transpose of Z. (zz') € Rid corresponds to the concatenation of z € R? and 
z! € R”. We use to represent an identity matrix. We use © to denote element- 
wise (Hadamard) product. We represent a sequence as [e;,¢2,...,e,] and a set as 
{e1,€2,...,e,} where e;s represent the elements in the sequence or set. 

In this chapter, we mainly consider attributed graphs. We represent an attributed 
graph as G = (V, A, X) where V = {v1,v2,.. . , Vn } is the set of vertices (aka nodes), 
n = |V| denotes the number of nodes, A € R”*” is an adjacency matrix, and X € 
R”*4 is a feature matrix where X; represents the features associated with the i node 
v; and d denotes the number of features. If there exists no edge between v; and vj, 
then A; ; = 0; otherwise, A; ; E€ R+ represents the weight of the edge where R+ 
represents positive real numbers. 

If Gis unweighted, then the range of A is {0,1} (ie. A € {0,1}"""). Gis undi- 
rected if the edges have no directions; it is directed if the edges have directions. 
For an undirected graph, A is symmetric (i.e. A = A). For each edge A; ; > 0 of 
a directed graph, we call v; the source and v; the target of the edge. If G is multi- 
relational with a set R = {r),...,/m} of relations, then the graph has m adjacency 
matrices where the i adjacency matrix represents the existence of the i relation r; 
between the nodes. 


15.2.1 Graph Neural Networks 


In this chapter, we use the term Graph Neural Network (GNN) to refer to the general 
class of neural networks that operate on graphs through message-passing between 
the nodes. Here, we provide a brief description of GNNs. 

Let G = (V, A, X) be a static attributed graph. A GNN is a function f : R’*” x 
R”*4 5 R"~' that takes G (or more specifically A and X) as input and provides as 
output a matrix Z € R"™~ where Z; € R” corresponds to a hidden representation 
for the i node v;. This hidden representation is called the node embedding. Provid- 
ing a node embedding for each node v; can be viewed as dimensionality reduction 
where the information from v,’s initial features as well as the information from its 
connectivity to other nodes and the features of these nodes are captured in a vector 
Zi. This vector can be used to make informed predictions about v;. In what follows, 
we describe two example GNNs namely graph convolutions networks and graph 
attention networks for undirected graphs. 
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Graph Convolutional Networks: Graph convolutional networks (GCNs) (Kipf 
and Welling, 2017b) stack multiple layers of graph convolution. The / layer of GCN 
for an undirected graph G = (V, A, X) can be formulated as follows: 


Z0 =6o(D-? AD Zw) (15.1) 


where A = A+ corresponds to the adjacency matrix with self-loops, D is a diagonal 


degree matrix with D; i = Ajl (1 represents a column vector of ones) and D; j = 
l > 1 A . a 
0 fori # j, D-2 AD 2 corresponds to a row and column normalization of A, 


ZO ER and ZO) eR) represent the node embeddings in layer / and 
(1 —1) respectively with Z9=X,WMe RIO xa represents the weight matrix 
at layer l, and o is an activation function. 

The / layer of a GCN model can be described in terms of the following steps. 
First, it applies a linear projection to the node embeddings Z“! = using the weight 
matrix W, then for each node v; it computes a weighted sum of the projected em- 
beddings of v; and its neighbors where the weights for the weighted sum are speci- 
fied according to D- 7 AD->, and finally it applies a non-linearity to the weighted 
sums and updates the node embeddings. Notice that in a L-layer GCN, the embed- 
ding for each node is computed based on its L-hop neighborhood (i.e. based on the 
nodes that are at most L hops away from it). 

Graph Attention Networks: Instead of fixing the weights when computing a 
weighted sum of the neighbors, attention-based GNNs replace D7 2AD~? in equa- 
tion 15.1 with an attention matrix A € R”*” such that: 


ZO = 6 ( AMZ w0) (15.2) 
E” 

a(l ij l x l-1 l-1 

A} =y E} = Ajjexp (a(z, 2-9; 0) (15.3) 
Le Eig 


l-1 l-1) x . : 
where a: R” x Rd _, R is a function with parameters 6) that computes 
attention weights for pairs of nodes. Here, A acts as a mask that ensures BY =0 


(and consequently A? = 0) if v; and v; are not connected. The exp function in the 


®) 
Y and the normalization “a correspond to a (masked) soft- 
i KE; 
max function of the attention weights. Different attention-based GNNs can be con- 


structed with different choices of a. In graph attention networks (GATs) (Veličković 
et al, 2018), ae Ra and @ is defined as follows: 


computation of Æ 


I-1 l-1 I-1 1—1 
a( Zh), 2°; 00) = o(00 WZ? || w0 z®®)) (15.4) 
where o is an activation function. The formulation in equation 15.2 corresponds to 
a single-head attention-based GNN. A multi-head attention-based GNN computes 
multiple attention matrices AG),..., AGB) using equation 15.3 but with differ- 
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ent weights AD 068) and WCD, ..., WB) and then replaces equation 15.2 
with: 
Z = 6 ( AOD ZEDVWED jj... |] ACD ZED ws) (15.5) 


where f is the number of heads. Each head may learn to aggregate the neighbors 
differently and extract different information. 


15.2.2 Sequence Models 


Over the years, several models have been proposed that operate on sequences. In 
this chapter, we are mainly interested in neural sequence models that take as input a 
sequence [æ , æ) ,..., æ] of observations where 2 € Rf for allt € {1,...,T}, 
and produce as output hidden representations hO, h),..., h®] where h® € R” 
for allt € {1,...,T}. Here, T represents the length of the sequence or the timestamp 
for the last element in the sequence. Each hidden representation h” isa sequence 
embedding capturing information from the first t observations. Providing a sequence 
embedding for a given sequence can be viewed as dimensionality reduction where 
the information from the first t observations in the sequence is captured in a single 
vector h) which can be used to make informed predictions about the sequence. In 
what follows, we describe recurrent neural networks, Transformers, and convolu- 
tional neural networks for sequence modeling. 

Recurrent Neural Networks: Recurrent neural networks (RNNs) (Elman, 1990) 
and its variants have achieved impressive results on a range of sequence modeling 
problems. The core principle of the RNN is that its output is a function of the current 
data point as well as a representation of the previous inputs. Vanilla RNNs consume 
the input sequence one by one and provides embeddings using the following equa- 
tion (applied sequentially for ¢ in [1,..., T]): 


h® = RNN (2) hD) = o (Wr +wMne-) +b) (15.6) 


where Ws and b are the model parameters, h'® is the hidden state corresponding 
to the embedding of the first t observations, and x) is the t observation. One may 
initialize h® = 0, where 0 is a vector of Os, or let h be learned during training. 
Training vanilla RNNs is typically difficult due to gradient vanishing and exploding. 

Long short term memory (LSTMs) (Hochreiter and Schmidhuber, 1997) (and 
gated recurrent units (GRUs) (Cho et al, 2014a)) alleviate the training problem of 
vanilla RNNs through gating mechanism and additive operations. An LSTM model 
consumes the input sequence one by one and provides embeddings using the fol- 
lowing equations: 
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Fig. 15.1: An LSTM model taking as input a sequence c a)... a2 and pro- 
ducing hidden representations h, h®),...,h(®) as output. Equations 15.7-15.11 
describe the operations in LSTM Cells. 


=o (Wn wl pid 4 B) (15.7) 
fO =0 (wa twit) pt-) + oi") (15.8) 
ce) = fO oe) 44 © Tanh (wal twee pd 4 Be) (15.9) 
o=e (Wah 4Wwll)pt-D 4 D) (15.10) 
h®) =o ©Tanh (c”) (15.11) 


Here i), f, and o”) represent the input, forget and output gates respectively, 
c") is the memory cell, h is the hidden state corresponding to the embedding of 
the sequence until t observation, oO is an activation function (typically Sigmoid), 
Tanh represents the hyperbolic tangent function, and W)s and bs are weight 
matrices and vectors. Similar to vanilla RNNs, one may initialize AO = ce =O0or 
let them be vectors with learnable parameters. Figure 15.1 shows an overview of an 
LSTM model. 

A bidirectional RNN (BiRNN) (Schuster and Paliwal, 1997) is a combination of 
two RNNs one consuming the input sequence (a), (2), Wy a] in the forward 


Sy > > 
direction and producing hidden representations hO, h®,..., h] as output, 
and the other consuming the input sequence backwards (i.e. [æ gD., „£p 


— 
and producing hidden representations hO, hD., hy as output. These two 


hidden representations are then concatenated producing a single hidden representa- 
tion h® = (h © h ©). Note that in RNNs, h®) is computed only based on obser- 


vations at or before t whereas in BiRNNs, h is computed based on observations 
at, before, or after t. BiLSTMs Graves et al (2005) are a specific version of BiRNNs 
where the RNN is an LSTM. 

Transformers: Consuming the input sequence one by one makes RNNs not 
amenable to parallelization. It also makes capturing long-range dependencies dif- 
ficult. To solve these issues, the Transformer model Vaswani et al (2017) allows 
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processing a sequence as a whole. The central operation in Transformer models is 
the self-attention mechanism. Let H“/—!) be an embedding matrix in layer (1—1) 


such that its t row HU” represents the embedding of the first t observations. The 
self-attention mechanism at each layer / can be described similar to equation 15.2 
and equation 15.3 for attention-based GNNs by defining A in equation 15.3 as a 
lower triangular matrix where Ai, j=lifi< jand Ai. j = 0 otherwise, replacing Z (D) 
and Z'—!) with H and H“'—)), and defining the @ function in equation 15.3 as 
follows: 


Q: Ky (1,0) py (1) (LK) py(I-1) 
= Q=W IA) K = WU 
Vd’) 
(15.12) 


where 6! = {W(2), W*)} are the weights with W2), WU) € REY xd 
The matrices Q and K are called the query and key matrices!. Q, and Ky represent 
column vectors corresponding to the t and r’ row of Q and K, respectively. After 
L layers, the hidden representations H (4) contain the sequence embeddings with 


H, (1) corresponding to the embedding of the first t observations (denoted as h”) for 
RNNs). The lower-triangular matrix A ensures that the embedding H, (2) is computed 
based only on the observations at and before the t observation. One may define A as 
a matrix of all 1s to allow H, (1) to be computed based on the observations at, before, 
and after the t observation (similar to BiRNNs). 

In equation 15.12, the embeddings are updated based on an aggregation of the 
embeddings from the previous timestamps, but the order of these embeddings is not 
modeled explicitly. To enable taking the order into account, the embeddings in the 
Transformer model are initialized as H, (0) =g") +p or H, 0 = (a || p®) where 
HO is the t row of H©), x is the t observation, and p“) is a positional encoding 
capturing information about the position of the observation in the sequence. In the 
original work, the positional encodings are defined as follows: 


ps) =sin(¢/100007/7), p), | = sin(t /10000%/4 + 7/2) (15.13) 


Note that p™ is constant and does not change during training. 

Convolutional Neural Networks: Convolutional neural networks (CNNs) (Le Cun 
et al, 1989) have revolutionized many computer vision applications. Originally, 
CNNs were proposed for 2D signals such as images. They were later used for 1D 
signals such as sequences and time-series. Here, we describe 1D CNNs. We start 
with describing 1D convolutions. Let H € R’*@ be a matrix and F € R"“¢ be a 
convolution filter. Applying the filter F on H produces a vector h’ € R’~“*t! as 
follows: 


u d 
h= Y Y His jieFie (15.14) 


' For readers familiar with Transformers, in our description the values matrix corresponds to the 
multiplication of the embedding matrix with the weight matrix W in equation 15.2. 
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Input Filter 1 
Result 
0.1 | -0.2 | 1.1 | 0.2 0.4 | 0.0 | 1.0 | 04 
5.88 | -2.98 
0.9 | -0.8 | 1.0 | 1.0 0.0 | -1.2 | 3.2 | 0.5 — 


(-1.2)(0.9)+(-0.8)(0.8)+(0.0)(1.0}+(0.0)(1.0) 
+ (0.0)(0.2)+(0.0)(0.3)+(-3.2)(0.4)+(0.5)(0.5) 


œ> |275 | -1as]  =-275 


6.92 | -6.82 


2.93 | -2.75 |-> 
0.2 | 0.3 | o4 | os ne 


0.6 | -0.6 | 0.5 | -0.5 Filter 2 


11 | 1.2 | 2.1 | 2.2 -1.2 | 0.8 | 0.0 | 0.0 
7.22 | -2.96 


0.0 | 0.0 | 1.0 | 1.2 0.0 | 0.0 | -3.2 | 0.5 


Fig. 15.2: An example of a 1D convolution operation with two convolution filters. 


It is also possible to produce a vector h’ € R” (i.e. a vector whose dimension is the 
same as the first dimension of H) by padding H with zeros. Having d’ convolution 
filters, one can generate d’ vectors as in equation 15.14 and stack them to generate a 
matrix H’ € R-"+))*4' (or H’ € R™’), Figure 15.2 provides an example of 1D 
convolution. 

The 1D convolution operation in equation 15.14 is the main building block of 
the 1D CNNs. Similar to equation 15.12, let us assume HC a1) represents the em- 
beddings in the | layer with Ho = x”) where Ho represents the t row of H() 
and a) is the t observation. 1D CNN models apply multiple convolution filters to 
H'-) as described above and produce a matrix to which activation and (some- 
times) pooling operations are applied to produce H (D, The convolution filters are 
the learnable parameters of the model. Hereafter, we use the term CNN to refer to 
the general family of 1D convolutional neural networks. 


15.2.3 Encoder-Decoder Framework and Model Training 


A deep neural network model can typically be decomposed into an encoder and a de- 
coder module. The encoder module takes the input and provides vector-representations 
(or embeddings), and the decoder module takes the embeddings and provides pre- 
dictions. The GNNs and sequence models described in Sections 15.2.1 and 15.2.2 
correspond to the encoder modules of a full model; they provide node embeddings 
Z and sequence embeddings H, respectively. The decoder is typically task-specific. 
As an example, for a node classification task, the decoder can be a feed-forward neu- 
ral network applied on a node embedding Z; provided by the encoder, followed by a 
softmax function. Such a decoder provides as output a vector % € RICI where C rep- 
resents the classes, |C| represents the number of classes, and ĝ; shows the probabil- 
ity of the node belonging to the j class. A similar decoder can be used for sequence 
classification. As another example, for a link prediction problem, the decoder can 
take as input the embeddings for two nodes, take the sigmoid of a dot-product of the 
two node embeddings, and use the produced number as the probability of an edge 
existing between the two nodes. 

The parameters of a model are learned through optimization by minimizing a 
task-specific loss function. For a classification task, for instance, we typically as- 
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sume having access to a set of ground-truth labels Y where Y; ; = 1 if the i example 
belongs to the j class and Y; ; = 0 otherwise. We learn the parameters of the model 
by minimizing (e.g., using stochastic gradient descent) the cross entropy loss de- 
fined as follows: 


1 : 
L=- } } Yi jlog(¥;,;) (15.15) 
IY; jl i j 


where |Y; ;| denotes the number of rows in Y; ; corresponding to the number of 
labeled examples, and Ŷ, j is the probability of the i example belonging to the j 
class according to the model. For other tasks, one may use other appropriate loss 
functions. 


15.3 Categories of Dynamic Graphs 


Different applications give rise to different types of dynamic graphs and different 
prediction problems. Before commencing the model development, it is crucial to 
identify the type of dynamic graph and its static and evolving parts, and have a clear 
understanding of the prediction problem. In what follows, we describe some general 
categories of dynamic graphs, their evolution types, and some common prediction 
problems for them. 


15.3.1 Discrete vs. Continues 


As pointed out in (Kazemi et al, 2020), dynamic graphs can be divided into discrete- 
time and continuous-time categories. Here, we describe the two categories and point 
out how discrete-time can be considered a specific case of continuous-time dynamic 
graphs. 

A discrete-time dynamic graph (DTDG) is a sequence (6), G2,...,G] of 
graph snapshots where each G = (V“), A,X) has vertices V, adjacency 
matrix A) and feature matrix X“. DTDGs mainly appear in applications where 
(sensory) data is captured at regularly-spaced intervals. 


Example 15.1. Figure 15.3 shows three snapshots of an example DTDG. In the first 
snapshot, there are three nodes. In the next snapshot, a new node v4 is added and a 
connection is formed between this node and v2. Furthermore, the features of vı are 
updated. In the third snapshot, a new edge has been added between v3 and vq. 


A special type of DTDGs is the spatio-temporal graphs where a set of entities are 
spatially (i.e. in terms of closeness in space) and temporally correlated and data is 
captured at regularly-spaced intervals. An example of such a spatio-temporal graph 
is traffic data in a city or a region where traffic statistics at each road are computed at 
regularly-spaced intervals; the traffic at a particular road at time f is correlated with 
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Fig. 15.3: Three snapshots of an example DTDG. In the first snapshot, there are 3 
nodes. In the second snapshot, a new node v4 is added and a connection is formed 
between this node and v2. Moreover, the features of vı are updated. In the third 
snapshot, a new edge has been added between v3 and v4. 


the traffic at the roads connected to it at time t (spatial correlation) as well as the 
traffic at these roads and the ones connected to it at previous timestamps (temporal 
correlation). In this example, the nodes in each G") may represent roads (or road 
segments), the adjacency matrix A may represent how the roads are connected, 
and the feature matrix X ©) may represent the traffic statistics in each road at time t. 

A continuous-time dynamic graph (CTDG) is a pair (G“),O) where G®) = 
(vo), Alo), XC)) is a static graph? representing an initial state at time fo and O is 
a sequence of temporal observations/events. Each observation is a tuple of the form 
(event type, event ,timestamp) where event type can be a node or edge addition, 
node or edge deletion, node feature update, etc., event represents the actual event 
that happened, and timestamp is the time at which the event occurred. 


Example 15.2. An example of a CTDG is a pair (G‘),O) where G0) is the graph 
in the first snapshot of Figure 15.3 and the observations are as follows: 


O = [(add node, v4,20-05-2020), (add edge, (v2,v4),21-05-2020), 
(Feature update, (v1, [0.1,2]),28-05-2020), (add edge, (v3,va),04-06-2020)| 


where, e.g., (add node, v4,20-05-2020) is an observation corresponding to a new 
node v4 being added to the graph at time 20-05-2020. 


At any point t > tọ in time, a snapshot GO (corresponding to a static graph) can 
be obtained from a CTDG by updating G“) sequentially according to the obser- 
vations O that occurred before (or at) time f. In some cases, multiple edges may 
have been added between two nodes giving rise to multi-graphs; one may aggre- 
gate the edges to convert the multi-graph into a simple graph if required. Therefore, 
a DTDG can be viewed as a special case of a CTDG where only some regularly 
spaced snapshots of the CTDG are available. 


2 Note that we can have V) = {} corresponding to a graph with no nodes. We can also have 


A) = 0 for all i, j corresponding to a graph with no edges. 
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Example 15.3. For the CTDG in Example 15.2, assume to = 01-05-2020 and we 
only observe the state of the graph on the first day of each month (01-05-2020, 01- 
06-2020 and 01-07-2020 for this example). In this case, the CTDG will reduce to 
the DTDG snapshots in Figure 15.3. 


15.3.2 Types of Evolution 


For both DTDGs and CTDGs, various parts of the graph may change and evolve. 
Here, we describe some of the main types of evolution. As a running example, we 
use a dynamic graph corresponding to a social network where the nodes represent 
users and the edges represent connections such as friendship. 

Node addition/deletion: In our running example, new users may join the plat- 
form resulting in new nodes being added to the graph, and some users may leave the 
platform resulting in some nodes being removed from the graph. 

Feature update: Users may have multiple features such as age, country of resi- 
dence, occupation, etc. These features may change over time as users become older, 
move to a new country, or change their occupation. 

Edge addition/deletion: As time goes by, some users become friends resulting 
in new edges and some people stop being friends resulting in some edges being 
removed from the graph. As pointed out in (Trivedi et al, 2019), the observations 
corresponding to events between two nodes may be categorized into association 
and communication events. The former corresponds to events that lead to structural 
changes in the graph and result in a long-lasting flow of information between the 
nodes (e.g., the formation of new friendships in social networks). The latter cor- 
responds to events that result in a temporary flow of information between nodes 
(e.g., the exchange of messages in a social network). These two event categories 
typically evolve at different rates and one may model them differently, especially in 
applications where they are both present. 

Edge weight updates: The adjacency matrix corresponding to the friendships 
may be weighted where the weights represent the strength of the friendships (e.g., 
computed based on the duration of friendship or other features). In this case, the 
strength of the friendships may change over time resulting in edge weight updates. 

Relation updates: The edges between the users may be labeled where the label 
indicates the type of the connection, e.g., friendship, engagement, and siblings. In 
this case, the relation between two users may change over time (e.g., it may change 
from friendship to engagement). One may see relation update as a special case of 
edge evolution where one edge is deleted and another edge is added (e.g., the friend- 
ship edge is removed and an engagement edge is added). 
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15.3.3 Prediction Problems, Interpolation, and Extrapolation 


We review four types of prediction problems for dynamic graphs: node classifica- 
tion/regression, graph classification, link prediction, and time prediction. Some of 
these problems can be studied under two settings: interpolation and extrapolation. 
They can also be studied under a transductive or inductive prediction setting. In 
what follows, we will describe each prediction problem. We let be a (discrete-time 
or continuous-time) dynamic graph containing information in a time interval [fo, T]. 

Node classification/regression: Let V“) = {v1,...,v,} represent the nodes in at 
time t. Node classification at time ¢ is the problem of classifying a node v; € V into 
a predefined set of classes C. Node regression at time f is the problem of predicting 
a continuous feature for a node v; € V). In the extrapolation setting, we make 
predictions about a future state (i.e. £ > T) and the predictions are made based on 
the observations before or att (e.g., forecasting the weather for the upcoming days). 
In the interpolation setting, to < t < T and the predictions are made based on all the 
observations (e.g., filling the missing values). 

Graph classification: Let {1, 2, ..., k} be a set of dynamic graphs. Graph clas- 
sification is the problem of classifying each dynamic graph i into a predefined set of 
classes C. 

Link prediction: Link prediction is the problem of predicting new links between 
the nodes of a dynamic graph. In the case of interpolation, the goal is to predict if 
there was an edge between two nodes v; and vj at timestamp to < t < T (or a time 
interval between fo and T), assuming that v; and v; are in at time t. The interpolation 
problem is also known as the completion problem and can be used to predict missing 
links. In the case of extrapolation, the goal is to predict if there is going to be an 
edge between two nodes v; and v; at a timestamp f > T (or a time interval after T) 
assuming that v; and v; are in the at time T. 

Time prediction: Time prediction is the problem of predicting when an event 
happened or when it will happen. In the case of interpolation (sometimes called 
temporal scoping), the goal is to predict the time fg < t < t when an event occurred 
(e.g., when two nodes v; and v; started or ended their connection). In the extrapola- 
tion case (sometimes called time to event prediction), the goal is to predict the time 
t > T when an event will happen (e.g., when a connection will be formed between 
vi and vj). 

Transductive vs. Inductive: The above problem definitions for node classifi- 
cation/regression, link prediction, and time prediction correspond to a transductive 
setting in which at the test time, predictions are to be made for entities already ob- 
served during training. In the inductive setting, information about previously unseen 
entities (or entirely new graphs) is provided at the test time and predictions are to 
be made for these entities (see (Hamilton et al, 2017b; Xu et al, 2020a; Albooyeh 
et al, 2020) for examples). The graph classification task is inductive by nature as it 
requires making predictions for previously unseen graphs at the test time. 
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15.4 Modeling Dynamic Graphs with Graph Neural Networks 


In Section 15.2.1, we described how applying a GNN on a static graph G provides an 
embedding matrix Z € IR"*’ where n is the number of nodes, d” is the embedding 
dimension, and Z; represents the embedding for the i entity v; and can be used to 
make predictions about it. For dynamic graphs, we wish to extend GNNs to obtain 
embeddings Z” € R”xd' for any timestamp ft, where n; is the number of nodes in 
the graph at time ¢ and Zz) captures the information about the i entity at time t. In 
this section, we review several such extensions of GNNs. We mainly describe the 
encoder part of the models for dynamic graphs as the decoder and the loss functions 
can be defined similarly to Section 15.2.3. 


15.4.1 Conversion to Static Graphs 


A simple but sometimes effective approach for applying GNNs on dynamic graphs 
is to first convert the dynamic graph into a static graph and then apply a GNN on the 
resulting static graph. The main benefits of this approach include simplicity as well 
as enabling the use of a wealth of GNN models and techniques for static graphs. 
One disadvantage with this approach, however, is the potential loss of information. 
In what follows, we describe two conversion approaches. 

Temporal aggregation: We start with describing temporal aggregation for a par- 
ticular type of dynamic graphs and then explain how it extends to more general 
cases. Consider a DTDG [G"),G®),...,G‘] where each GO = (Vv), AM, x) 
such that V) =... = V) =V and X®) =...= XC) = X (ie. the nodes and their 
features are fixed over time and only the adjacency matrix evolves). Note that in this 
case, the adjacency matrices have the same shape. One way to convert this DTDG 
into a static graph is through a weighted aggregation of the adjacency matrices as 
follows: 


T 
ĄA(088) — E olt, t) A (15.16) 


Il 
pos 


where @ : R x R —> R provides the weight for the t adjacency matrix as a function of 
t and T. For extrapolation problems, a common choice for @ is @(t, 7) =exp(—0(t— 
t)) corresponding to exponentially decaying the importance of the older adjacency 
matrices (Yao et al, 2016). Here, @ is a hyperparameter controlling how fast the 
importance decays. For interpolation problems where a prediction is to be made for 
a timestamp 1 < t’ < T, one may define the function as @(t,t’) = exp(—@|r’ — t|) 
corresponding to exponentially decaying the importance of the adjacency matrices 
as they move further away from t’. Through this aggregation, one can convert the 
DTDG above into a static graph G = (V, A(tss) X ) and subsequently apply a static 
GNN model on it to make predictions. It is important to note that the aggregated 
adjacency matrix is weighted (i.e. A(’8) € R”*”) so one can only use the GNN 
models that can handle weighted graphs. 
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First Snapshot Second Snapshot Third Snapshot 


Fig. 15.4: An example of converting a DTDG into a static graph through temporal 
unrolling. Solid lines represent the edges in the graph at different timestamps and 
dashed lines represent the added edges. In this example, each node is connected 
to the node corresponding to the same entity only in the previous timestamp (i.e. 
@= 1). 


In the case where node features also evolve, one may use a similar aggregation as 
in equation 15.16 and compute X (488) based on [xX XO, hs xX). In the case 
where nodes are added and removed, one possible way of aggregation is as follows. 
Let V) = {v| ve VOU--U vi} represent the set of all the nodes that existed 
throughout time. We can expand every A“) to a matrix in RV|xIV! where the 
values for the rows and columns corresponding to any node v ¢ VČ) are all Os. The 
feature vectors can be expanded similarly. Then, equation 15.16 can be applied on 
the expanded adjacency and feature matrices. A similar aggregation can be done for 
CTDGs by first converting it into a DTDG (see Section 15.3.1) and then applying 
equation 15.16. 


Example 15.4. Consider a DTDG with the three snapshots in Figure 15.3. We let 
VO) = {v1,v2,¥3,v4}, add a row and a column of zeros to AÙ, and add a row of 
zeros to X"), Then, we use equation 15.16 with some value of @ to compute A (488) 
and X (“88). Then we apply a GNN on the aggregated graph. 


Temporal unrolling: Another way of converting a dynamic graph into a static 
graph is unrolling the dynamic graph and connecting the nodes corresponding to 
the same object across time. Consider a DTDG [G"),G®),...,G™] and let GË) = 
(VO, A,X) fort € {1,..., T}. Let GS) = (V%), A“), X) represent the static 
graph to be generated from the DTDG. We let V“) = {v |v eV,r € {1,..., TH}. 
That is, every node v € VČ) at every timestamp ¢ € {1,...,t} becomes a new node 
named v“) in V (so |V%)| = £; |V® J). Note that this is different from the way 
we constructed V(®) for temporal aggregation: here every node at every timestamp 
becomes a node in V(®) whereas in temporal aggregation we took a union of the 
nodes across timestamps. For every node v) eV), we let the features of v”) in 
X) to be the same as its features in X. If two nodes ViVi E V are connected 
according to A), we connect the corresponding nodes in A‘), We also connect 
each node v” to v) for t! € {max(1,t — @),...,t — 1} so a node corresponding 
to an entity at time t becomes connected to the nodes corresponding to the same 
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entity at the previous @ timestamps, where @ is a hyperparameter. One may assign 
different weights to these temporal edges in A(s) based on the difference between t 
and f’ (e.g., exponentially decaying the weight). Having constructed the static graph 
G), one may apply a GNN model on it and, e.g., use the resulting embedding 
for vs (i.e. the nodes corresponding to the t timestamp of the DTDG) to make 
predictions about the nodes. 


Example 15.5. Figure 15.4 provides an example of temporal unrolling for the DTDG 
in Figure 15.3 with œ = 1. The graph has 11 nodes overall and so_.A“) € R!!*!!, The 
node features are set according to the ones in Figure 15.3, e.g., the feature values 


for ye?) are 0.1 and 2. 


15.4.2 Graph Neural Networks for DTDGs 


One natural way of developing models for DTDGs is by combining GNNs with 
sequence models; the GNN captures the information within the node connections 
and the sequence model captures the information within their evolution. A large 
number of the works on dynamic graphs in the literature follow this approach — see, 
e.g., (Seo et al, 2018; Manessi et al, 2020; Xu et al, 2019a). Here, we describe some 
generic ways of combining GNNs with sequence models. 

GNN-RNN: Let be a DTDG with a sequence [G“), ...,G‘“)] of snapshots where 
GY =(V, AM, X) for eacht € {1,...,7}. Suppose we want to obtain node em- 
beddings at some time t < T based on the observations at or before t. For simplicity, 
let us assume V(!) = V) =... =v = V, i.e. the nodes are the same through- 
out time (in cases where the nodes change, one may use a similar strategy as in 
Example 15.4). 

We can apply a GNN to each of the GË )s and obtain a hidden representation 
matrix Z“ whose rows correspond to node embeddings. Then, for the i node v;, we 
obtain a sequence of embeddings Ze, z9, ssw z0). These embeddings do not 
yet contain temporal information. To incorporate the temporal aspect of the DTDG 
into the embeddings and obtain a temporal embedding for v; at time t, we can feed 


the sequence |Z 0) Z D eA O) into an RNN model defined in equation 27.1 by 


į 29i i 


replacing x) with Zz) and using the hidden representation of the RNN model as 
the temporal node embedding for v;. The temporal embedding for other nodes can be 
obtained similarly by feeding their sequence of embeddings produced by the GNN 
model to the same RNN model. The following formulae describe a variant of the 
GNN-RNN model where the GNN is a GCN (defined in equation 15.1), the RNN is 
an LSTM model, and the LSTM operations are applied to all nodes embeddings at 
the same time (the formulae are applied sequentially for f in [1,2,...,7]). 
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Z =GCN(X, A) (15.17) 
IM=o6 ( ZOw) + HDW p b) (15.18) 
FO =0 ( ZOw if) 4 HDW 4 vi") (15.19) 
CO = FO o CD 41 © Tanh (20w™ +H we) 4 b) (15.20) 
Jag (Z9OWwo + H- Dwr +p) (15.21) 
H® = O0 ©Tanh (ct ) (15.22) 


where, similar to equations 15.7-15.11, I ©, FO, and O} represent the input, for- 
get and output gates for the nodes respectively, C (9) is the memory cell, H (1) is the 
hidden state corresponding to the node embeddings for the first t observation, and 
W)s and bOs are weight matrices and vectors. In the above formulae, when we 
add a matrix Z0W\) + H¢-Dw) with a bias vector bO, we assume the bias 
vector b“) as added to every row of the matrix. H) and C©) can be initialized with 
zeros or learned from the data. HO) corresponds to the temporal node embeddings 
at time ¢ and can be used to make predictions about them. We can summarize the 
equations above into: 


Z = GCN(X, A) (15.23) 
HO CC® =LSTM(Z®, HOY, Cc) (15.24) 


In a similar way, one can construct other variations of the GNN-RNN model such as 
GCN-GRU, GAT-LSTM, GAT-RNN, etc. Figure 15.5 provides an overview of the 
GCN-LSTM model. 
RNN-GNN: In cases where the graph structure is fixed through time (i.e. AW) = 
= AM = A) and only node features change, instead of first applying a GNN 
model and then applying a sequence model to obtain temporal node embeddings, 
one may apply the sequence model first to capture the temporal evolution of the 
node features and then apply a GNN model to capture the correlations between the 
nodes. We can create different variations of this generic model by using different 
GNN and sequence models (e.g., LSTM-GCN, LSTM-GAT, GRU-GCN, etc.). The 
formulation for a LSTM-GCN model is as follows: 


H®,C® =isTM(X HO) CW) (15.25) 
Z) = GCN(H, A) (15.26) 


with Z() containing the temporal node embeddings at time t. Note that RNN-GNN 
is only appropriate if the the adjacency matrix is fixed over time; otherwise, RNN- 
GNN fails to capture the information within the evolution of the graph structure. 
GNN-BiRNN and BiRNN-GNN: In the case of GNN-RNN and RNN-GNN, 
the obtained node embeddings H () contain information about the observations at 
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Fig. 15.5: The GCN-LSTM model taking a sequence GG!) G2), ...,E as input 
and producing hidden representations H), H®),...,H) as output. The opera- 
tions in LSTM Cells are described in equations 15.18-15.22. The GCN modules 
have shared parameters. 


or before time t. This is appropriate for extrapolation problems. For interpolation 
problems (e.g., when we want to predict missing links between edges at a timestamp 
t < T), however, we may want to use the observations before, at, or after time t. One 
possible way of achieving this is by combining a GNN with a BiRNN so that the 
BiRNN provides information from not only the observations at or before time ¢ but 
also after time f. 

GNN-Transformer: Combining GNNs with Transformers can be done in a sim- 
ilar way as in GNN-RNNs. We apply a GNN to each of the Gs and obtain a 
hidden representation matrix Z () whose rows correspond to node embeddings. 
Then for the i entity v;, we create a matrix H(®® such that HCO”? = z® +p” 
(or HC’ = Z® pO) where p\) is the positional encoding vector for position t. 
That is, the t row of H (9) contains the embedding Zz) of v; obtained by apply- 
ing the GCN model on Go, plus the positional encoding. The 0 superscript in 
H% shows that H corresponds to the input of a Transformer model in the 
0 layer. Once we have H), we can apply an L-layer Transformer model (see 
equations 15.2, 15.3 and 15.12) to obtain H (Li) where He) corresponds to the 
temporal embedding of v; at time t. For extrapolation, the matrix A in equation 15.3 
is a lower triangular matrix with Aj, j = 1ifi< jand 0 otherwise; for interpolation, 
A is a matrix of all 1s. The GCN-Transformer variant of the GNN-Transformer 
model can be described using the following equations: 
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Z = GCn(X, A™®) fort € {1,2,...,7} (DAN 
HO? = ZP +p forte {1,2,... 7}, i€ {1,2,...,|V]} (15.28) 
HG) = Trans former(H ©? , A) fori € {1,2,...,|V|} (15.29) 


GNN-CNN: In a similar way as GNN-RNN and GNN-Transformer, one can 
combine GNNs with CNNs where the GNN provides [Z®, Z2),..., Z], then the 


embeddings A sdi z®)] for each node v; are stacked into a matrix H0» 
similar to the GNN-Transformer model, and then a 1D CNN model is applied on 
HC’) (see Section 15.2.2) to provide the final node embeddings. 

Creating Deeper Models: Consider the GCN-LSTM model in Figure 15.5. The 
output of the GCN module is a sequence [Z D ZO... Z (2)] and the outputs of the 
LSTM module is a sequence of hidden representation matrices [H!), H®),..., H™]. 
Let us call the output of the GCN module as [Z“), Z2),...,Z(9] and the 
output of the LSTM module as [H(!-)), H),..., H9] where the added su- 
perscript | indicates that these are the hidden representations created at layer 1. 
One may consider each H (1.9) as the new node features for the nodes in G”) and 
run a GCN module (with separate parameters from the initial GCN) again to ob- 
tain [ZD Z2, pini Z7], Then, another LSTM module may operate on these 
matrices to produce [H'?!), HO? ,..., H]. Stacking L of these GCN-LSTM 
blocks produces |H (1) HE2... H (L,t)] as output. These hidden matrices can 
then be used for making predictions about the nodes. The | layer of this model can 


be formulated as below (the formulae are applied sequentially for t in [1,...,7]): 
VAG t) = GCN(H"—1), AM) (15.30) 
HCO, 069 =18TM(Z69,HOD,CG)) (15.31) 


where H+) = X( for t € {1,...,7}. The above two equations define what is 
called a GCN-LSTM block. Other blocks can be constructed using similar combina- 
tions. 


15.4.3 Graph Neural Networks for CTDGs 


Recently, developing models that operate on CTDGs without converting them to 
DTDGs (or converting them to static graphs) has been the subject of several studies. 
One class of models for CTDGs is based on extensions of the sequence models 
described in Section 15.2.2, especially RNNs. The general idea behind these models 
is to consume the observations sequentially and update the embedding of a node 
whenever a new observation is made about that node (or, in some works, about one 
of its neighbors). Before describing GNN-based approaches for CTDGs, we briefly 
describe some of the RNN-based models for CTDGs. 

Consider a CTDG with Gt") = (V0), A), X(0)) with A‘) = 0 for all i, j (Le. 
no initial edges) and observations O whose only type is edge additions. Since the 
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only observation types are edge additions, for this CTDG, the nodes and their fea- 
tures are fixed over time. Let Z~) represent the node embeddings right before time 
t (initially, Z) = X ©) or Z() = X (0) W where W is a weight matrix with learn- 
able parameters). Upon making an observation (AddEdge, (v;,v;),t) corresponding 
to a new directed edge between two nodes v;,v; € V, the model developed in (Kumar 
et al, 2019b) updates the embeddings for v; and v; as follows: 


ZË = RNNoource((Z} || At || f), 2) (15.32) 
ZË = RNNarga (ZS || Aty || P), ZS) (15.33) 


i j 
where RNNsource and RNNyarger are two RNNs with different weights?, At; and At; 
represent the time elapsed since v;’s and v;’s previous interactions respectively*, f 
represents a vector of features corresponding to edge features (if any), || indicates 


concatenation, and Zz) and Zz represent the updated embeddings at time t. The 
first RNN takes as input a new observation (Za || Az; || f) and the previous 


hidden state of a node Zz) and provides an updated representation (similarly for 
the second RNN). Besides learning a temporal embedding Z (© as described above, 
in (Kumar et al, 2019b) another embedding vector is also learned for each entity 
that is fixed over time and captures the static features of the nodes. The two embed- 
dings are then concatenated to produce the final embedding that is used for making 
predictions. 

In Trivedi et al (2017), a similar strategy is followed to develop a model for 
CTDGs with multi-relational graphs in which two custom RNNs update the node 
embeddings for the source and target nodes once a new labeled edge is observed 
between them. In Trivedi et al (2019), a model is developed that is similar to 
the above models but closer in nature to GNNs. Upon making an observation 
(AddEdge, (vi,v;),t), the node embedding for v; is updated as follows (and simi- 
larly for v;): 


Z® = RNN ((zy (v;)At), ZO”) (15.34) 


i 


where ZN (v;) is an embedding that is computed based on a custom attention- 
weighted aggregation of the embeddings of v; and its neighbors at time f, and At; is 
defined similarly as in equation 15.32. Unlike equation 15.32 where the RNN up- 
dates the embedding of v; based on the embedding of v; alone, in equation 15.34 
the embedding of v; is updated based on an aggregation of the embeddings from the 
first-order neighborhood of v; which makes it close in nature to GNNs. 

Many of the existing RNN-based approaches for CTDGs only compute the node 
embeddings based on their immediate neighboring nodes (or nodes that are 1-hop 


3 The reason for using two RNNs is to allow the source and target nodes of a directed graph to be 
updated differently upon making the observation (AddEdge, (v;,v;),t). If the graph is undirected, 
one may use a single RNN. 


4 Tf this is the first interaction of v; (or vj), then Az; (or At;) can be the time elapsed since tọ. 
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away from them) and do not take into account the nodes that are multi-hops away. 
We now describe a GNN-based model for CTDGs named temporal graph attention 
networks (TGAT) and developed in (Xu et al, 2020a) that computes node embed- 
dings based on the k-hop neighborhood of the nodes (i.e. based on the nodes that 
are at most k hops away). Being a GNN-based model, TGAT can learn embeddings 
for new nodes that are added to a graph and can be used for inductive settings where 
at the test time, predictions are to be made for previously unseen nodes. 

Similar to the Transformer model, TGAT removes the recurrence and instead 
relies on self-attention and an extension of positional encoding to continuous time 
encoding named Time2Vec. In Time2Vec (Kazemi et al, 2019), time f (or a delta of 
time as in equation 15.32 and equation 15.34) is represented as a vector z“) defined 
as follows: 
ge ei I (15.35) 

' sin (@;t +ọ;), if 1<i<k. 


where œ and @ are vectors with learnable parameters. TGAT uses a specific case of 
Time2Vec where the linear term is removed and the parameters @ are fixed to Os and 
58 similar to equation 15.13. We refer the reader to Kazemi et al (2019); Xu et al 
(2020a) for theoretical and practical motivations of such a time encoding. 

Now we describe how TGAT computes node embeddings. For a node v; and 
timestamp f, let M © represent the set of nodes that interacted with v; at or before 


time t and the timestamps for the interaction. Each element of ⁄⁄ © is of the form 
(vj,tk) where ty < t. The | layer of TGAT computes the embedding h") for v; at 
time ¢ in layer / using the following steps: 


1. For any node v;, h? 0i) (corresponding to the embedding of v; in the 0 layer in 


time t) is assumed to be equal to X; for any value of t. 

2. A matrix K!) with ZAJ rows is created such that for each (vj, tg) € NO, 
Kt") has a row (hite!“h) || 2%) where h\e!-!) corresponds to the em- 
bedding of v; in layer (/ — 1) at the time t% of its interaction with v; and zt) 
is an encoding for the delta time (t — t) as in equation 15.35. Note that each 
hit!) js computed recursively using the same steps outlined here. 

3. A vector g!) is computed as (h!—!) 2) where hi*!-!) is the embedding 
of v; at time f in layer (J — 1) and 29) is an encoding for a delta of time equal 
to 0 as in equation 15.35. 

4. qt) is used to determine how much v; should attend to each row of K‘ 
corresponding to the representation of its neighbors. Attention weights a! 
are computed using equation 15.12 where the j element of alti) is computed 
A8 Ta — a(g, K’; a). 

5. Having the attention weights, a representation hi is computed for v; using 
equation 15.2 where the attention matrix AM is replaced with the attention 
vector alt), 


t,1,i) 
t,1,i) 


tli) 


> For simplicity, here we describe a single-head attention-based GNN version of TGAT; in the 
original work, a multi-head version is used (see equation 15.5 for details.) 
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6. Finally, A!) = FF (h6 Li) A!) computes the representation for node v; 
at time t in layer l where FF ( is a feed-forward neural network in layer I. 


An L-layer TGAT model computes node embeddings based on the L-hop neigh- 
borhood of a node. 

Suppose we run a 2-layer TGAT model on a temporal graph where v; interacted 
with v; at time ft) < t and v; interacted with vz at time t2 < tı. The embedding h2) 
is computed based on the embedding hit) which is itself computed based on the 
embedding h'2:°-4), Since we are now at 0 layer, h(2.0) in TGAT is approximated 
with X; thus ignoring the interactions vg has had before time t2. This may be sub- 
optimal if v; has had important interactions before t2 as these interactions are not 
reflected on h-!) and hence not reflected on h“2), In (Rossi et al, 2020), this 
problem is remedied by using a recurrent model (similar to those introduced at the 
beginning of this subsection) that provides node embeddings at any time based on 
their previous local interactions, and initializing hs with these embeddings. 


15.5 Applications 


In this chapter, we provide some examples of real-world problems that have been 
formulated as predictions over dynamic graphs and modeled using GNNs. In partic- 
ular, we review applications in computer vision, traffic forecasting, and knowledge 
graphs. This is by no means a comprehensive list; other application domains include 
recommendation systems Song et al (2019a), physical simulation of object trajecto- 
ries Kipf et al (2018), social network analysis Min et al (2021), automated software 
bug triaging Wu et al (2021a), and many more. 


15.5.1 Skeleton-based Human Activity Recognition 


Human activity recognition from videos is a well-studied problem in computer vi- 
sion with several applications. Given a video of a human, the goal is to classify 
the activity performed by the human in the video into a pre-defined set of classes 
such as walking, running, dancing, etc. One possible approach for this problem is 
to make predictions based on the human body skeleton as the skeleton conveys im- 
portant information for human action recognition. In this subsection, we provide a 
dynamic graph formulation of this problem and a modeling approach based mainly 
on (a simplified version of) the approach of (Yan et al, 2018a). 

Let us begin with formulating the skeleton-based activity recognition problem as 
reasoning over a dynamic graph. A video is a sequence of frames and each frame 
can be converted into a set of n nodes corresponding to the key points in the skeleton 
using computer vision techniques (see, e.g., (Cao et al, 2017)). These n nodes each 
have a feature vector representing their (2D or 3D) coordinates in the image frame. 
The human body specifies how these key points are connected to each other. With 
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Fig. 15.6: The human skeleton represented as a graph for each snapshot of a video. 
The nodes represent the key points and the edges represent connections between 
these key points. The t graph corresponds to the human skeleton obtained from the 
t frame of a video. 


this description, we can formulate the problem as reasoning over a DTDG consisting 
of a sequence [G"!),G),...,G] of graphs where each GO = (V, AM, X) 
corresponds to the t frame of a video with vi) representing the set of key points in 
the t frame, A) representing their connections, and X (1) representing their features. 
An example is provided in Figure 15.6. One may notice that vi) =...=yM=ay 
and Al!) —...= A( = A, i.e. the nodes and the adjacency matrices remain fixed 
throughout the sequence because they correspond to the key points and how they 
are connected in the human body. For instance, in the graphs of Figure 15.6, the 
node numbered as 3 is always connected to the nodes numbered as 2 and 4. The 
feature matrices XC), however, keep changing as the coordinates of the key points 
change in different frames. The activity recognition can now be cast as classifying 
a dynamic graph into a set of predefined classes C. 

The approach employed in (Yan et al, 2018a) is to convert the above DTDG into 
a static graph through temporal unrolling (see Section 15.4.1). In the static graph, 
the node corresponding to a key point at time ¢ is connected to other key points at 
time ¢ according to the human body (or, in other words, according to A) as well 
as the nodes representing the same key point and its neighbors in the previous @ 
timestamps. Once a static graph is constructed, a GNN can be applied to obtain em- 
beddings for every joint at every timestamp. Since activity recognition corresponds 
to graph classification in this formulation, the decoder may consist of a (max, mean, 
or another type of) pooling layer on the node embeddings to obtain a graph em- 
bedding followed by a feed-forward network and a softmax layer to make class 
predictions. 

In the | layer of the GNN in (Yan et al, 2018a), the adjacency matrix is multiplied 
element-wise to a mask matrix M) with learnable parameters (i.e. A © M () is 
used as the adjacency matrix). M ( can be considered a data-independent attention 
map that learns weights for the edges in A. The goal of M © is to learn which 
connections are more important for activity recognition. Multiplying by M ®) only 
allows for changing the weight of the edges in A but it cannot add new edges. 
Connecting the key points according to the human body may arguably not be the 
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best choice as, e.g., the connection between the hands is important in recognizing 
the clapping activity. In (Li et al, 2019e), the adjacency is summed with two other 
matrices B®) and C\) (i.e. A+ BM + C is used as the adjacency) where BO 
is a data-independent attention matrix similar to M and C” is a data-dependent 
attention matrix. Adding two matrices B® and C\) to A allows for not only chang- 
ing the edge weights in A but also adding new edges. 

Instead of converting the dynamic graph to a static graph through temporal un- 
rolling and applying a GNN on the static graph as in the previous two works, in Shi 
et al (2019b), (among other changes) a GNN-CNN model is used. One can use 
other combinations of a GNN and a sequence model (e.g., GNN-RNN) to obtain 
embeddings for joints at different timestamps. Note that activity recognition is not 
an extrapolation problem (i.e. the goal is not to predict the future based on the past). 
Therefore, to obtain the joint embeddings at time t, one may use information not 
only from GU) where t’ < t but also from timestamps t’ > t. This can be done by 
using, e.g., a GNN-BiRNN model (see Section 15.4.2). 


15.5.2 Traffic Forecasting 


For urban traffic control, traffic forecasting plays a paramount role. To predict the 
future traffic of a road, one needs to consider two important factors: spatial depen- 
dence and temporal dependence. The traffics in different roads are spatially depen- 
dent on each other as future traffic in one road depends on the traffic in the roads 
that are connected to it. The spatial dependence is a function of the topology of the 
road networks. There is also temporal dependence for each road because the traffic 
volume on a road at any time depends on the traffic volume at the previous times. 
There are also periodic patterns as, e.g., the traffic in a road may be similar at the 
same times of the day or at the same times of the week. 

Early approaches for traffic forecasting mainly focused on temporal dependen- 
cies and ignored the spatial dependencies (Fu et al, 2016). Later approaches aimed 
at capturing spatial dependencies using convolutional neural networks (CNNs) (Yu 
et al, 2017b), but CNNs are typically restricted to grid structures. To enable captur- 
ing both spatial and temporal dependencies, several recent works have formulated 
traffic forecasting as reasoning over a dynamic graph (DTDGs in particular). 

We first start by formulating traffic forecasting as a reasoning problem over a 
dynamic graph. One possible formulation is to consider a node for each road seg- 
ment and connect two nodes if their corresponding road segments intersect with 
each other. The node features are the traffic flow variables (e.g., speed, volume, and 
density). The edges can be directed, e.g., to show the flow of the traffic in one-way 
roads, or undirected, showing that traffic flows in both directions. The structure of 
the graph can also change over time as, e.g., some road segments or some intersec- 
tions may get (temporarily) closed. One may record the traffic flow variables and 
the state of the roads and intersections at regularly-spaced time intervals resulting 
in a DTDG. Alternatively, one may record the variables at different (asynchronous) 
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time intervals resulting in a CTDG. The prediction problem is a node regression 
problem as we require to predict the traffic flow for the nodes, and it is an extrapo- 
lation problem as we need to predict the future state of the flow. The problem can 
be studied under a transductive setting where a model is trained based on the traffic 
data in a region and tested for making predictions about the same region. It can also 
be studied under an inductive setting where a model is trained based on the traffic 
data in multiple regions and is tested on new regions. 

In (Zhao et al, 2019c), a model is proposed for transductive traffic forecasting 
in which the problem is formulated as reasoning over a DTDG with a sequence 
[G“) ,G@),... GO] of snapshots. The graph structure is considered to be fixed (i.e. 
no changes in road or intersection conditions) but the node features, corresponding 
to traffic flow features, change over time. The proposed model is aGCN-GRU model 
(see Section 15.4.2) where the GCN captures the spatial dependencies and the GRU 
captures the temporal dependencies. At any time f, the model provides a hidden 
representation matrix H () based on the information at or before t; the rows of this 
matrix correspond to the node embeddings. These embeddings can then be used to 
make predictions about the traffic flow in the next timestamp(s). Assuming yer) 
represents the predictions for the next timestamp and yet) represents the ground 
truth, the model is trained by minimizing an L2-regularized sum of the absolute 
errors || Yt) — y +D], 

As explained in Section 15.2.2, RNN-based models (e.g., the GCN-GRU model 
above) typically require sequential computations and are not amenable to paral- 
lelization. In (Yu et al, 201 8a), the temporal dependencies are captured using CNNs 
instead of RNNs. The proposed model contains multiple blocks of CNN-GNN-CNN 
where the GNN is a generalization of GCNs to multi-dimensional tensors and the 
CNNs are gated. 

The two works described so far consider the adjacency matrix to be fixed in dif- 
ferent timestamps. As explained earlier, however, the adjacency matrix may change 
over time, e.g., due to accidents and roadblocks. In (Diao et al, 2019), the change 
in the adjacency matrix is taken into account through estimating the change in the 
topology of the roads based on the short-term traffic data. 


15.5.3 Temporal Knowledge Graph Completion 


Knowledge graphs (KGs) are databases of facts. A KG contains a set of facts in the 
form of triples (v;,rj, vg) where v; and v are called the subject and object entities 
and rj is a relation. A KG can be viewed as a directed multi-relational graph with 
nodes V = {vj,...,Vn}, relations R = {r|,...,7m}, and m adjacency matrices where 
the j adjacency matrix corresponds to the relations of type r; between the nodes 
according to the triples. 

A temporal knowledge graph (TKG) contains a set of temporal facts. Each fact 
may be associated with a single timestamp indicating the time when the event spec- 
ified by the fact occurred, or a time interval indicating the start and end timestamps. 
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The facts with a single timestamp typically represent communication events and the 
facts with a time interval typically represent associative events (see Section 15.3.2)°. 
Here, we focus on facts with a single timestamp for which a TKG can be defined as a 
set of quadruples of the form (v;,rj,vk,t) where f indicates the time when (vj,77j,Vx) 
occurred. Depending on the granularity of the timestamps, one may think of a TKG 
as a DTDG or a CTDG. 

TKG completion is the problem of learning models based on the existing tempo- 
ral facts in a TKG to answer queries of the type (v;,rj,?,t) (or (?,rj, vg,t)) where the 
correct answer is an entity v € V such that (vj,rj,v,t) (or (v,rj,Vx,t)) has not been 
observed during training. It is mainly an interpolation problem as queries are to be 
answered at a timestamp ¢ based on the past, present, and future facts. Currently, the 
majority of the models for TKG completion are not based on GNNs (e.g., see (Goel 
et al, 2020; Garcfa-Duran et al, 2018; Dasgupta et al, 2018; Lacroix et al, 2020)). 
Here, we describe a GNN-based approach that is mainly based on the work in (Wu 
et al, 2020b). 

Since TKGs correspond to multi-relational graphs, to develop a GNN-based 
model that operates on a TKG we first need a relational GNN. Here, we describe 
a model named relational graph convolution network (RGCN) (Schlichtkrull et al, 
2018) but other relational GNN models can also be used (see, e.g., (Vashishth et al, 
2020)). Whereas GCN projects all neighbors of a node using the same weight ma- 
trix (see Section 15.2.1), RGCN applies relation-specific projections. Let Ê be a a 
set of relations that includes every relation in R = {r1,...,7m} as well as a self-loop 
relation ro where each node has the relation rọ only with itself. As is common in 
directed graphs (see, e.g., (Marcheggiani and Titov, 2017)) and specially for multi- 
relational graphs (see, e.g., (Kazemi and Poole, 2018)), for each relation r; E€ R we 
also add an auxiliary relation r7! to Ê where v; has relation r7! with wẹ if and only 
if vg has relation r; with v;. The | layer of an RGCN model can then be described as 
follows: 


ZW) = o( È DO” ANZ Dy) (15.36) 


rek 
where A”) € R”*” represents the adjacency matrix corresponding to relation r, DO) 


is the degree matrix of AC) with pW”) representing the number of incoming relations 


: eit ae : l : 
of type r for the i node, D~ is a normalization term’, W'”) is a relation-specific 


weight matrix for layer l, Z G-1) represents the node embeddings in the (l-1) layer, 
and Z0 represents the updated node embeddings in the | layer. If initial features 
X are provided as input, Z\) can be set to X. Otherwise, Z) can either be set as 


(0 


1-hot encodings where Z; is a vector whose elements are all zeros except in the 


6 This, however, is not always true as one may break a fact such as (v;, LivedIn, v i) with a time 
interval [2010, 2015] (meaning from 2010 until 2015) into a fact (vi, StartedLivingIn,v;) with a 
timestamp of 2010 and another fact (v;, EndedLivingIn,v;) with a timestamp of 2015. 


7 One needs to handle the cases where D?) = 0 to avoid numerical issues. 
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i position where it is 1, or it can be randomly initialized and then learned from the 
data. 

In (Wu et al, 2020b), a TKG is formulated as a DTDG consisting of a sequence 
of snapshots [GM G@), wee GO] of multi-relational graphs. Each G") contains the 
same set of entities V and relations R (corresponding to all the entities and relations 
in the TKG) and contains the triples (v;,r j, vg,t) from the TKG that occurred at time 
t. Then, RGCN-BiGRU and RGCN-Transformer models are developed (see Sec- 
tion 15.4.2) that operate on the DTDG formulation of the TKG where the RGCN 
model provides the node embeddings at every timestamp and the BiGRU and Trans- 
former models aggregate the temporal information. Note that in each G® there may 
be several nodes with no incoming and outgoing edges (and also no features since 
TKGs typically do not have node features). RGCN does not learn a representation 
for these nodes as there exists no information about them in G®). To handle this, 
special BiGRU and Transformer models are developed in (Wu et al, 2020b) that 
handle missing values. 

The RGCN-BiGRU and RGCN-Transformer models provide node embeddings 
H at any timestamp f. To answer a query such as (v;,r;,?,f), one can compute the 
plausibility score of (v;,r j, vk,t) for every vg € V and select the entity that achieves 
the highest score. A common approach to find the score for an entity vg for the above 
query is to use the TransE decoder Bordes et al (2013) according to which the score 
is -|HP +Rj- HË || where H” and HË correspond to the node embeddings 
for v; and vx at time t (provided by the RGCN) and R is a matrix with learnable 
parameters which has m = |R| rows each corresponding to an embedding for a re- 
lation. TransE and its extensions are known to make unrealistic assumptions about 
the types and properties of the relations Kazemi and Poole (2018), so, alternatively, 
one may use other decoders that has been developed within the knowledge graph 
embedding community (e.g., the models in (Kazemi and Poole, 2018; Trouillon 
et al, 2016)). 

When the timestamps in the TKG are discrete and there are not many of them, 
one can use a similar approach as above to answer queries of the form (v;, rj, vk, ?) 
through finding the score for every t in the set of discrete timestamps and selecting 
the one that achieves the highest score (see, e.g., (Leblay and Chekol, 2018)). Time 
prediction for TKGs has been also studied in an extrapolation setting where the goal 
is to predict when an event is going to happen in the future. This has been mainly 
done using temporal point processes as decoders (see, e.g., (Trivedi et al, 2017, 
2019)). 


15.6 Summary 


Graph-based techniques are emerging as leading approaches in the industry for ap- 
plication domains with relational information. Among these techniques, graph neu- 
ral networks (GNNs) are currently among the top-performing approaches. While 
GNNs and other graph-based techniques were initially developed mainly for static 
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graphs, extending these approaches to dynamic graphs has been the subject of sev- 
eral recent studies and has found success in several important areas. In this chapter, 
we reviewed the techniques for applying GNNs to dynamic graphs. We also re- 
viewed some of the applications of dynamic GNNs in different domains including 
computer vision, traffic forecasting, and knowledge graphs. 


Editor’s Notes: In the universe, the only thing unchanged is “change” it- 
self, so do networks. Hence extending techniques for simple, static net- 
works to those for dynamic ones is inevitably the trend while this domain 
is progressing. While there is a fast-increasing research body for dynamic 
networks in recent years, much more efforts are needed in order for sub- 
stantial progress in the key issues such as scalability and validity discussed 
in Chapter 5 and other chapters. Extensions of the techniques in Chapters 
9-18 are also needed. Many real-world applications radically speaking, re- 
quires to consider dynamic network, such as recommender system (Chapter 
19) and urban intelligence (Chapter 27). So they could also benefit from the 
technique advancement toward dynamic networks. 
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Abstract Heterogeneous graphs (HGs) also called heterogeneous information net- 
works (HINs) have become ubiquitous in real-world scenarios. Recently, employing 
graph neural networks (GNNs) to heterogeneous graphs, known as heterogeneous 
graph neural networks (HGNNs) which aim to learn embedding in low-dimensional 
space while preserving heterogeneous structure and semantic for downstream tasks, 
has drawn considerable attention. This chapter will first give a brief review of the 
recent development on HG embedding, then introduce typical methods from the 
perspective of shallow and deep models, especially HGNNS. Finally, it will point 
out future research directions for HGNNs. 


16.1 Introduction to HGNNs 


Heterogeneous graphs (HGs) (Sun and Han, 2013), which compose different types 
of entities and relations, also known as heterogeneous information networks (HINs), 
are ubiquitous in real-world scenarios, ranging from bibliographic networks, social 
networks to recommender systems. For example, as shown in Fig. 16.1 (a), a biblio- 
graphic network can be represented by a HG, which consists of four types of entities 
(author, paper, venue, and term) and three types of relations (author-write-paper, 
paper-contain-term and conference-publish-paper); and these basic relations can be 
further derived for more complex semantics (e.g., author-write-paper-contain-item). 
It has been well recognized that HG is a powerful model that embraces rich seman- 
tic and structural information. Therefore, researches on HG have been experiencing 
tremendous growth in data mining and machine learning, many of which have suc- 
cessful applications such as recommendation (Shi et al, 201 8a; Hu et al, 2018a), text 
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analysis (Linmei et al, 2019; Hu et al, 2020a), and cybersecurity (Hu et al, 2019b; 
Hou et al, 2017). 

Due to the ubiquity of HGs, how to learn embedding of HGs is a key re- 
search problem in various graph analysis applications, e.g., node/graph classifica- 
tion (Dong et al, 2017; Fu et al, 2017), and node clustering (Li et al, 2019¢). Tradi- 
tionally, matrix factorization methods (Newman, 2006b) generate latent features in 
HGs. However, the computational cost of decomposing a large-scale matrix is usu- 
ally very expensive, and also suffers from its statistical performance drawback (Shi 
et al, 2016; Cui et al, 2018). To address this challenge, heterogeneous graph embed- 
ding, aiming to learn a function that maps input space into lower-dimensional space 
while preserving heterogeneous structure and semantic, has drawn considerable at- 
tention in recent years. 

Although there have been ample studies of embedding technology on homoge- 
neous graphs (Cui et al, 2018) which consist of only one type of nodes and edges, 
these techniques cannot be directly applicable to HGs due to heterogeneity. Specif- 
ically, (1) the structure in HGs is usually semantic dependent, e.g., meta-path struc- 
ture (Dong et al, 2017) can be very different when considering different types of 
relations; (2) different types of nodes and edges have different attributes located in 
different feature spaces; (3) HGs are usually application dependent, which may need 
sufficient domain knowledge for meta-path/meta-graph selection. 

To tackle the above issues, various HG embedding methods have been proposed 
(Chen et al, 2018b; Hu et al, 2019a; Dong et al, 2017; Fu et al, 2017; Wang et al, 
2019m; Shi et al, 2018a; Wang et al, 2020n). From the technical perspective, we 
divide the widely used models in HG embedding into two categories: shallow mod- 
els and deep models. In summary, shallow models initialize the node embeddings 
randomly, then learn the node embeddings through optimizing some well-designed 
objective functions to preserve heterogeneous structures and semantics. Deep model 
aims to use deep neural networks (DNNs) to learn embedding from node attributes 
or interactions, where heterogeneous graph neural networks (HGNNs) stand out and 
will be the focus of this chapter. And there have demonstrated the success of HG 
embedding techniques deployed in real-world applications including recommender 
systems (Shi et al, 2018a; Hu et al, 2018a; Wang et al, 2020n), malware detection 
systems (Hou et al, 2017; Fan et al, 2018; Ye et al, 2019a), and healthcare systems 
(Cao et al, 2020; Hosseini et al, 2018). 

The remainder of this chapter is organized as follows. In Sect. 27.1, we first 
introduce basic concepts in HGs, then discuss unique challenges of HG embedding 
due to the heterogeneity and give a brief review of the recent development on HG 
embedding. In Sect. 24.2 and 20.3, we categorize and introduce HG embedding in 
details according to the shallow and deep models. In Sect. 20.4, we further review 
pros and cons of the models introduced above. Finally, Sect. 20.5 forecasts the future 
research directions for HGNNs. 
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Fig. 16.1: An illustrative example of a heterogeneous graph (Wang et al, 20201). 
(a) A bibliographic graph including four types of entities (i.e., author, paper, venue 
and term) and three types of relations (i.e., publish, contain and write). (b) Network 
schema of the bibliographic graph. (c) Two meta-paths (i.e., author-paper-author 
and paper-term-paper). (d) A meta-graph used in the bibliographic graph. 


16.1.1 Basic Concepts of Heterogeneous Graphs 


In this section, we will first formally introduce basic concepts in HGs and illustrate 
the symbols used throughout this chapter. HG is a graph consisting of different types 
of entities (i.e., nodes) and/or different types of relations (i.e., edges), which can be 
defined as follows. 


Definition 16.1. Heterogeneous Graph (or Heterogeneous Information Network) 
(Sun and Han, 2013). A HG is defined as a graph Z = {¥,&}, in which Y and 

& represent the node set and the edge set, respectively. Each node v € Y and 

each edge e € & are associated with their mapping function ¢(v) : V¥ > & and 

o(e):€ > Z. A and Z denote the node type set and edge type set, respectively, 

where |.2|+|#| > 2. The network schema for @ is defined as .Y = (% , 2), which 

can be seen as a meta template of a heterogeneous graph Z = {V,&} with the 

node type mapping function @(v) : ¥ — & and the edge type mapping function 

o(e): & > Z. The network schema is a graph defined over node types , with 

edges as relation types from &. 


HG not only provides graph structure of data association, but also portrays 
higher-level semantics. An example of HG is illustrated in Fig. 16.1 (a), which 
consists of four node types (author, paper, venue, and term) and three edge types 
(author-write-paper, paper-contain-term, and conference-publish-paper), and Fig. 
16.1 (b) illustrates the network schema. To formulate semantics of higher-order re- 
lationships among entities, meta-path (Sun et al, 2011) is further proposed whose 
definition is given below. 


Definition 16.2. Meta-path (Sun et al, 2011). A meta-path p is based on network 
schema .%, which is denoted as p = Nj PL N2 P, e ia Ni+1 (simplified to 
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NiN2---Ni+41) with node types Ni, N2,--- N41 E€ M and edge types R1,R2,+++R) € 
B. 


Note that different meta-paths describe semantic relationships in different views. 
For example, the meta-path APA indicates the co-author relationship and APCPA 
represents the co-conference relation. Both of them can be used to formulate the 
relatedness over authors. Although meta-path can be used to depict the relatedness 
over entities, it fails to capture a more complex relationship, such as motifs (Milo 
et al, 2002). To address this challenge, meta-graph (Huang et al, 2016b) is proposed 
to use a directed acyclic graph of entity and relation types to capture more complex 
relationships between entities, defined as follows. 


Definition 16.3. Meta-graph (Huang et al, 2016b). A meta-graph 7 can be seen 
as a directed acyclic graph (DAG) composed of multiple meta-paths with common 
nodes. Formally, meta-graph is defined as J = (Vz7,&z7), where Vz is a set of 
nodes and £z is a set of edges. For any node v E€ ¥%7,0(v) € &; for any edge 
eE E7, Q(e) EA. 


An example meta-graph is shown in Fig. 16.1 (d), which can be regarded as 
the combination of meta-path APA and APCPA, reflecting high-order similarity of 
two nodes. Note that a meta-graph can be symmetric or asymmetric (Zhang et al, 
2020g). To learn embeddings of HG, we formalize the problem of heterogeneous 
graph embedding. 


Definition 16.4. Heterogeneous Graph Embedding (Shi et al, 2016). Heteroge- 
neous graph embedding aims to learn a function ® : VY — R that embeds the nodes 
v € Y in HG into low-dimensional Euclidean space with d < |V|. 


16.1.2 Challenges of HG Embedding 


Different from homogeneous graph embedding (Cui et al, 2018), where the basic 
problem is preserving structure and property in node embedding (Cui et al, 2018). 
Due to the heterogeneity, HG embedding imposes more challenges, which are illus- 
trated below. 

Complex Structure (the complex HG structure caused by multiple types of 
nodes and edges). In a homogeneous graph, the fundamental structure can be con- 
sidered as first-order, second-order, and even higher-order structures (Tang et al, 
2015b). All these structures are well defined and have good intuition. However, the 
structure in HGs will dramatically change depending on the selected relations. Let’s 
still take the academic graph in Fig. 16.1 (a) as an example, the neighbors of one 
paper will be authors with the “write” relation; while with “contain” relation, the 
neighbors become terms. Complicating things further, the combination of these re- 
lations, which can be considered as higher-order structures in HGs, will result in 
different and more complicated structures. Therefore, how to efficiently and effec- 
tively preserve these complex structures is of great challenge in HG embedding, 


16 Heterogeneous Graph Neural Networks 355 


Decomposition-based Methods: HERec, EOE, ete. 


Shallow Models 


Random Walk-based Methods: metapath2vec, Spacey, etc 


Heterogeneous Graph 
Embedding Message Passing-based Methods (HGNNs): HetGNN, HAN, ete 


Encoder-decoder-based Methods: DHNE, HNE, etc. 


Deep Models 


Adversarial-based Methods: HeGAN, MV-ACM, etc 


Fig. 16.2: Heterogeneous graph embedding tree classification diagram. 


while current efforts have been made towards the meta-path structure (Dong et al, 
2017) and meta-graph structure (Zhang et al, 201 8b). 

Heterogeneous Attributes (the fusion problem caused by the heterogeneity of 
attributes). Since nodes and edges in a homogeneous graph have the same type, each 
dimension of the node or edge attributes has the same meaning. In this situation, 
node can directly fuse attributes of its neighbors. However, in HGs, the attributes 
of different types of nodes and edges may have different meanings (Zhang et al, 
2019b; Wang et al, 2019m). For example, the attributes of author can be research 
fields, while paper may use keywords as attributes. Therefore, how to overcome 
the heterogeneity of attributes and effectively fuse the attributes of neighbors poses 
another challenge in HG embedding. 

Application Dependent. HG is closely related to the real-world applications, 
while many practical problems remain unsolved. For example, constructing an ap- 
propriate HG may require sufficient domain knowledge in a real-world application. 
Also, meta-path and/or meta-graph are widely used to capture the structure of HGs. 
However, unlike homogeneous graph, where the structure (e.g., the first-order and 
second-order structure) is well defined, meta-path selection may also need prior 
knowledge. Furthermore, to better facilitate the real-world applications, we usu- 
ally need to elaborately encode side information (e.g., node attributes) (Wang et al, 
2019m; Zhang et al, 2019b) or more advanced domain knowledge (Shi et al, 2018a; 
Chen and Sun, 2017) to HG embedding process. 


16.1.3 Brief Overview of Current Development 


Most of early works on graph data are based on high-dimensional sparse vectors 
for matrix analysis. However, the sparsity of the graph in reality and its growing 
scale have created serious challenges for such methods. A more effective way is 
to map nodes to latent space and use low-dimensional vectors to represent them. 
Therefore, they can be more flexibly applied to different data mining tasks, i.e., 
graph embedding. 

There has been a lot of works dedicated to homogeneous graph embedding (Cui 
et al, 2018). These works are mainly based on deep models and combined with graph 
properties to learn embeddings of nodes or edges. For instance, DeepWalk (Perozzi 
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et al, 2014) combines random walk and skip-gram model; LINE (Tang et al, 2015b) 
utilizes first-order and second-order similarity to learn distinguished node embed- 
ding for large-scale graphs; SDNE (Wang et al, 2016) uses deep auto-encoders to 
extract non-linear characteristics of graph structure. In addition to structural infor- 
mation, many methods further use the content of nodes or other auxiliary informa- 
tion (such as text, images, and tags) to learn more accurate and meaningful node 
embeddings. Some survey papers comprehensively summarize the work in this area 
(Cui et al, 2018; Hamilton et al, 2017c). 

Due to the heterogeneity, embedding techniques for homogeneous graphs can- 
not be directly applicable to HGs. Therefore, researchers have begun to explore 
HG embedding methods, which emerge in recent years but develop rapidly. From 
the technical perspective, we summarize the widely used techniques (or models) in 
HG embedding, which can be generally divided into two categories: shallow mod- 
els and deep models, as shown in Fig. 16.2. Specifically, shallows model mainly 
rely on meta-paths to simplify the complex structure of HGs, which can be classi- 
fied into decomposition-based and random walk-based. Decomposition-based tech- 
niques Chen et al (2018b); Xu et al (2017b); Shi et al (2018b,c); Matsuno and Mu- 
rata (2018); Tang et al (2015a); Gui et al (2016) decompose complex heteroge- 
neous structure into several simpler homogeneous structures; while random walk- 
based (Dong et al, 2017; Hussein et al, 2018) methods utilize meta-path-guided ran- 
dom walk to preserve specific first-order and high-order structures. In order to take 
full advantage of heterogeneous structures and attributes, deep models are three- 
fold: message passing-based (HGNNs), encoder-decoder-based and adversarial- 
based methods. Message passing mechanism, i.e., the core idea of graph neural net- 
works (GNNs), seamlessly integrates structure and attribute information. HGNNs 
inherit the message passing mechanism and design suitable aggregation functions 
to capture rich semantic in HGs (Wang et al, 2019m; Fu et al, 2020; Hong et al, 
2020b; Zhang et al, 2019b; Cen et al, 2019; Zhao et al, 2020b; Zhu et al, 2019d; 
Schlichtkrull et al, 2018). The remaining encoder-decoder-based (Tu et al, 2018; 
Chang et al, 2015; Zhang et al, 2019c; Chen and Sun, 2017) and adversarial-based 
(Hu et al, 2018a; Zhao et al, 2020c) techniques employ encoder-decoder framework 
or adversarial learning to preserve complex attribute and structural information of 
HGs. In the following sections, we will introduce representative works of their sub- 
categories in detail and compare their pros and cons. 


16.2 Shallow Models 


Early HG embedding methods focus on employing shallow models. They first ini- 
tialize node embeddings randomly, then learn node embeddings through optimizing 
some well-designed objective functions. We divide the shallow model into two cat- 
egories: decomposition-based and random walk-based. 
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Fig. 16.3: An illustrative example of the proposed meta-path-guided random walk 
in HERec (Shi et al, 2018a). HERec first perform random walks guided by some 
selected meta-paths, then filter node sequences not with the user type or item type. 


16.2.1 Decomposition-based Methods 


To cope with the challenges brought by heterogeneity, decomposition-based tech- 
niques (Chen et al, 2018b; Xu et al, 2017b; Shi et al, 2018b,c; Matsuno and Murata, 
2018; Tang et al, 2015a; Gui et al, 2016) decompose HG into several simpler sub- 
graphs and preserve the proximity of nodes in each sub-graph, finally merge the 
information to achieve the effect of divide and conquer. 

Specifically, HERec (Shi et al, 2018a) aims to learn embeddings of users and 
items under different meta-paths and fuses them for recommendation. It first finds 
the co-occurrence of users and items based on the meta-path-guided random walks 
on user-item HG, as shown in Fig. 16.3. Then it uses node2vec (Grover and 
Leskovec, 2016) to learn preliminary embeddings from the co-occurrence sequences 
of users and items. Because embeddings under different meta-paths contain differ- 
ent semantic information, for better recommendation performance, HERec designs 
a fusion function to unify the multiple embeddings: 


g(h?) = x (W?h? +b”), (16.1) 


Fi | p=1 

where h?’ is the embedding of user node u in meta-path p. P denotes the set of meta- 
paths. The fusion of item embeddings is similar to users. Finally, a prediction layer 
is used to predict the items that users prefer. HERec optimizes the graph embedding 
and recommendation objective jointly. 

As another example, EOE is proposed to learn embeddings for coupled HGs, 
which consist of two different but related subgraphs. It divides the edges in HG 
into intra-graph edges and inter-graph edges. Intra-graph edge connects two nodes 
with the same type, and inter-graph edge connects two nodes with different types. 
To capture the heterogeneity in inter-graph edge, EOE (Xu et al, 2017b) uses the 
relation-specific matrix M, to calculate the similarity between two nodes, which can 
be formulated as: 
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~ 1+exp {—h/ M,h;}- 


Similarly, PME (Chen et al, 2018b) decomposes HG into some bipartite graphs 
according to the types of edges and projects each bipartite graph into a relation- 
specific semantic space. PTE (Tang et al, 2015a) divides the documents into word- 
word graph, word-document graph and word-label graph. Then it uses LINE (Tang 
et al, 2015b) to learn the shared node embeddings for each sub-graph. HEBE (Gui 
et al, 2016) samples a series of subgraphs from a HG and preserves the proximity 
between the center node and its subgraph. 

The above-mentioned two-step framework of decomposition and fusion, as a 
transition product from homogeneous networks to HGs, is often used in the early 
attempt of HG embedding. Later, researchers gradually realized that extracting ho- 
mogeneous graphs from HGs would irreversibly lose information carried by hetero- 
geneous neighbors, and began to explore HG embedding methods that truly adapted 
to heterogeneous structure. 


16.2.2 Random Walk-based Methods 


Random walk, which generates some node sequences in a graph, is often used to 
describe the reachability between nodes. Therefore, it is widely used in graph rep- 
resentation learning to sample neighbor relationships of nodes and capture local 
structure in the graph (Grover and Leskovec, 2016). In homogeneous graphs, the 
node type is single and random walk can walk along any path. While in HGs, due to 
the type constraints of nodes and edges, meta-path-guided random walk is usually 
adopted, so that the generated node sequence contains not only the structural infor- 
mation, but also the semantic information. Through preserving the node sequence 
structure, node embedding can preserve both first-order and high-order proximity 
(Dong et al, 2017). A representative work is metapath2vec (Dong et al, 2017), which 
uses meta-path-guided random walk to capture semantic information of two nodes, 
e.g., the co-author relationship in academic graph as shown in Fig. 16.4. 
Metapath2vec (Dong et al, 2017) mainly uses meta-path-guided random walk to 
generate heterogeneous node sequences with rich semantic, then it designs a het- 
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erogeneous skip-gram technique to preserve the proximity between node v and its 
context nodes, i.e., neighbors in the random walk sequences: 


argmax ) > y log p(c; 


VEVEN EC; (v) 


v;@), (16.3) 


where C;(v) represents the context nodes of node v with type t. p(c;|v;@) denotes 
the heterogeneous similarity function on node v and its context neighbors c;: 


eheh 


Lry ear by 


From the diagram shown in Fig. 16.4, Eq. (16.4) needs to calculate similarity 
between center node and its neighbors. Then Mikolov et al (2013b) introduces a 
negative sampling strategy to reduce the computation. Hence, Eq. (16.4) can be 
approximated as: 


p(c)lv;8) = (16.4) 


Mis 


logo (h; -hy) + Esa ~p(p) log o (—hys -hy)}, (16.5) 


I 
uA 
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where o (-) is the sigmoid function, and P (F) is the distribution in which the negative 
node ¥/ is sampled for Q times. Through the strategy of negative sampling, the 
time complexity is greatly reduced. However, when choosing the negative samples, 
metapath2vec does not consider the types of nodes, i.e., different types of nodes 
are from the same distribution P(*). Thus it further designs metapath2vec++, which 
samples negative nodes of the same type as the central node, i.e., #7 ~ P(,). The 
formulation can be rewritten as: 


logo(hy, hy) +} tpn) [logo (—hy -hy) |. (16.6) 


After minimizing the objective function, metapath2vec and metapath2vec++ can 
capture both structural information and semantic information effectively and effi- 
ciently. 

Based on metapath2vec, a series of variants have been proposed. Spacey (He 
et al, 2019) designs a heterogeneous spacey random walk to unify different meta- 
paths with a second-order hyper-matrix to control transition probability among dif- 
ferent node types. JUST (Hussein et al, 2018) proposes a random walk method with 
Jump and Stay strategies, which can flexibly choose to change or maintain the type 
of the next node in the random walk without meta-path. BHIN2vec (Lee et al, 2019e) 
proposes an extended skip-gram technique to balance the various types of relations. 
It treats heterogeneous graph embedding as multiple relation-based tasks, and bal- 
ances the influence of different relations on node embeddings by adjusting the train- 
ing ratio of different tasks. HHNE (Wang et al, 2019n) conducts meta-path-guided 
random walk in hyperbolic space (Helgason, 1979), where the similarity between 
nodes can be measured using hyperbolic distance. In this way, some properties of 
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HGs, e.g., hierarchical and power-law structure, can be naturally reflected in learned 
node embeddings. 


16.3 Deep Models 


In recent years, deep neural networks (DNNs) have achieved great success in the 
fields of computer vision and natural language processing. Some works have also 
begun to use deep models to learn embedding from node attributes or interactions 
among nodes in HGs. Compared with shallow models, deep models can better cap- 
ture the non-linear relationship, which can be roughly divided into three categories: 
message passing-based, encoder-decoder-based and adversarial-based. 


16.3.1 Message Passing-based Methods (HGNNs) 


Graph neural networks (GNNs) have emerged recently. Its core idea is the message 
passing mechanism, which aggregates neighborhood information and transmits it as 
messages to neighbor nodes. Different from GNNs that can directly fuse attributes 
of neighbors to update node embeddings, due to different types of nodes and edges, 
HGNNs need to overcome the heterogeneity of attributes and design effective fusion 
methods to utilize neighborhood information. Therefore, the key component is to 
design a suitable aggregation function, which can capture semantic and structural 
information of HGs (Wang et al, 2019m; Fu et al, 2020; Hong et al, 2020b; Zhang 
et al, 2019b; Cen et al, 2019; Zhao et al, 2020b; Zhu et al, 2019d; Schlichtkrull et al, 
2018). 

Unsupervised HGNNs. Unsupervised HGNNs aim to learn node embeddings 
with good generalization. To this end, they always utilize interactions among dif- 
ferent types of attributes to capture the potential commonalities. HetGNN (Zhang 
et al, 2019b) is the representative work of unsupervised HGNNs. It consists of three 
parts: content aggregation, neighbor aggregation, and type aggregation. Content ag- 
gregation is designed to learn fused embeddings from different node contents, such 
as images, text, or attributes: 


_ Lica LSTM{ F 6 (hi) } OLSTML FG (hi) }] 


fi (v) Ic, | ’ 


(16.7) 


where C, is the type of node v’s attributes. h; is the i-th attributes of node v. A bi- 
directional Long Short-Term Memory (Bi-LSTM) (Huang et al, 2015) is used to fuse 
the embeddings learned by multiple attribute encoder ¥@. Neighbor aggregation 
aims to aggregate the nodes with same type by using a Bi-LSTM to capture the 
position information: 
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where N, (v) is the first-order neighbors of node v with type t. Type aggregation uses 
an attention mechanism to mix the embeddings of different types and produces the 
final node embeddings. 


h =a’ fi(v)+ } a” fv). (16.9) 


teO, 


where h, is the final embedding of node v, and O, denotes the set of node types. Fi- 
nally, a heterogeneous skip-gram loss is used as the unsupervised graph context loss 
to update node embeddings. Through these three aggregation methods, HetGNN 
can preserve the heterogeneity of both graph structures and node attributes. 

Other unsupervised methods capture either heterogeneity of node attributes or 
heterogeneity of graph structures. HNE (Chang et al, 2015) is proposed to learn em- 
beddings for the cross-model data in HGs, but it ignores the various types of edges. 
SHNE (Zhang et al, 2019c) focuses on capturing semantic information of nodes by 
designing a deep semantic encoder with gated recurrent units (GRU) (Chung et al, 
2014). Although it uses heterogeneous skip-gram to preserve the heterogeneity of 
graph, SHNE is designed specifically for text data. Cen proposes GATNE (Cen et al, 
2019), which aims to learn node embeddings in multiplex graph, i.e., a heteroge- 
neous graph with different types of edges. Compared with HetGNN, GATNE pays 
more attention to distinguishing different edge relationships between node pairs. 

Semi-supervised HGNNs. Different from unsupervised HGNNs, semi-supervised 
HGNNs aim to learn task-specific node embeddings in an end-to-end manner. For 
this reason, they prefer to use the attention mechanism to capture the most relevant 
structural and attribute information to the task. Wang (Wang et al, 2019m) propose 
heterogeneous graph attention network (HAN), which uses a hierarchical attention 
mechanism to capture both node and semantic importance. The architecture of HAN 
is shown in Fig. 16.5. 

It consists of three parts: node-level attention, semantic-level attention, and pre- 
diction. Node-level attention aims to utilize the self-attention mechanism (Vaswani 
et al, 2017) to learn importances of neighbors in a certain meta-path: 


m 


on Ploan: lh) 
7 Eee nm exp(o (a= i) 


(16.10) 


where 4⁄6" is the neighbors of node v; in meta-path m, œj} is the weight of node vj 


to node v; under meta-path m. The node-level aggregation is defined as: 


h'=0| } af-h;], (16.11) 
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Fig. 16.5: The architecture of HAN (Wang et al, 2019m). The whole model can 
be divided into three parts: Node-Level Attention aims to learn the importance of 
neighbors’ features. Semantic-Level Attention aims to learn the importance of dif- 
ferent meta-paths. Prediction layer utilizes the labeled nodes to update node embed- 
dings. 


where hi” denotes the learned embedding of node i based on meta-path m. Because 
different meta-paths capture different semantic information of HG, a semantic-level 
attention mechanism is designed to calculated the importance of meta-paths. Given 
a set of meta-paths {m9,m,---,mp}, after feeding node features into node-level 
attention, it has P semantic-specific node embeddings {Hing Hm; roo Amp }- To ef- 
fectively aggregate different semantic embeddings, HAN designs a semantic-level 
attention mechanism: 


Wm, = > Ya’: tanh(W - hi" +b), (16.12) 


where W € R“*4 and b € R“™! denote the weight matrix and bias of the MLP, 
respectively. q € IR?! is the semantic-level attention vector. In order to prevent the 
node embeddings from being too large, HAN uses the softmax function to normalize 
Wm,;- Hence, the semantic-level aggregation is defined as: 


H = E BaH in (16.13) 


where Bm; denotes the normalized w,,,, which represents the semantic importance. 
H € R*4 denotes the final node embeddings. Finally, a task-specific layer is used 
to fine-tune node embeddings with a small number of labels and the embeddings H 
can be used in downstream tasks, such as node clustering and link prediction. HAN 
is the first to extend GNNs to the heterogeneous graph and design a hierarchical 
attention mechanism, which can capture both structural and semantic information. 
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Subsequently, a series of attention-based HGNNs was proposed (Fu et al, 2020; 
Hong et al, 2020b; Hu et al, 2020e). MAGNN (Fu et al, 2020) designs intra-metapath 
aggregation and inter-metapath aggregation. The former samples some meta-path 
instances surrounding the target node and uses an attention layer to learn the impor- 
tance of different instances, and the latter aims to learn the importance of different 
meta-paths. HetSANN (Hong et al, 2020b) and HGT (Hu et al, 2020e) treat one 
type of node as query to calculate the importance of other types of nodes around it, 
through which the method can not only capture interactions among different types 
of nodes, but also assign different weights to neighbors during aggregation. 

In addition, there are some HGNNs that focus on other issues. NSHE (Zhao et al, 
2020b) proposes to incorporate network schema, instead of meta-path, in aggregat- 
ing neighborhood information. GTN (Yun et al, 2019) aims to automatically identify 
the useful meta-paths and high-order edges in the process of learning node embed- 
dings. RSHN (Zhu et al, 2019d) uses both original node graph and coarsened line 
graph to design a relation-structure aware HGNN. RGCN (Schlichtkrull et al, 2018) 
uses multiple weight matrices to project the node embeddings into different relation 
spaces, thus capturing the heterogeneity of the graph. 

Compared with shallow models, HGNNs have an obvious advantage that they 
have the ability of inductive learning, i.e., learning embeddings for out-of-sample 
nodes. Besides, HGNNs need smaller memory space because they only need to store 
model parameters. These two reasons are important for the real-world applications. 
However, they still suffer from the huge time costing in inferencing and retraining. 


16.3.2 Encoder-decoder-based Methods 


Encoder-decoder-based techniques aim to employ some neural networks as encoder 
to learn embedding from node attributes and design a decoder to preserve some 
properties of the graphs (Tu et al, 2018; Chang et al, 2015; Zhang et al, 2019c; Chen 
and Sun, 2017; Zhang et al, 2018a; Park et al, 2019). 

For example, DHNE (Tu et al, 2018) proposes hyper-path-based random walk to 
preserve both structural information and indecomposability of hyper-graphs. Specif- 
ically, it designs a novel deep model to produce a non-linear tuple-wise similarity 
function while capturing the local and global structures of a given HG. As shown 
in Fig. 16.6, taking a hyperedge with three nodes a,b and c as an example. The first 
layer of DHNE is an autoencoder, which is used to learn latent embeddings and pre- 
serve the second-order structures of graph (Tang et al, 2015b). The second layer is 
a fully connected layer with embedding concatenated: 


L = o (W,h, @W;h, $ W-h,), (16.14) 


where L denotes the embedding of the hyperedge; hy,h, and he € R?*! are the 
embeddings of node a, b and c learn by the autoencoder. Wa, Wp and W; € R X4 are 
the transformation matrices for different node types. Finally, the third layer is used 
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Fig. 16.6: The framework of DHNE (Tu et al, 2018). DHNE learns embeddings 
for nodes in heterogeneous hypernetworks, which can simultaneously address inde- 
composable hyperedges while preserving rich structural information. 


to calculate the indecomposability of hyperedge: 
Z =0(W-L+b), (16.15) 


where .Y denote the indecomposability of hyperedge; W € R!*34 "and b € R!*! are 
the weight matrix and bias, respectively. A higher value of .7 means these nodes 
are from the existing hyperedges, otherwise it should be small. 

Similarly, HNE (Chang et al, 2015) focuses on multi-modal heterogeneous graph. 
It uses CNN and autoencoder to learn embedding from images and texts, respec- 
tively. Then it uses the embedding to predict whether there is an edge between the 
images and texts. Camel (Zhang et al, 2018a) uses GRU as an encoder to learn paper 
embedding from the abstracts. A skip-gram objective function is used to preserve 
the local structures of the graphs. 


16.3.3 Adversarial-based Methods 


Adversarial-based techniques utilize the game between generator and discriminator 
to learn robust node embedding. In homogeneous graph, the adversarial-based tech- 
niques only consider the structural information, for example, GraphGAN (Wang 
et al, 2018a) uses Breadth First Search when generating virtual nodes. In a HG, the 
discriminator and generator are designed to be relation-aware, which captures the 
rich semantics on HGs. HeGAN (Hu et al, 2018a) is the first to use GAN in HG em- 
bedding. It incorporates the multiple relations into the generator and discriminator, 
so that the heterogeneity of a given graph can be considered. 

As shown in Fig. 16.7 (c), HeGAN mainly consists of two competing players, 
the discriminator and the generator. Given a node, the generator attempts to produce 
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Fig. 16.7: Overview of HeGAN (Hu et al, 2018a). (a) A toy HG for bibliographic 
data. (b) Comparison between HeGAN and previous works. (c) The framework of 
HeGAN for adversarial learning on HGs. 


fake samples associated with the given node to feed into the discriminator, whereas 
the discriminator tries to improve its parameterization to separate the fake samples 
from the real ones actually connected to the given node. The better trained discrimi- 
nator would then force the generator to produce better fake samples, and the process 
is repeated. During such iterations, both the generator and discriminator receive mu- 
tual, positive reinforcement. While this setup may appear similar to previous efforts 
(Cai et al, 2018c; Dai et al, 2018c; Pan et al, 2018) on GAN-based network embed- 
ding, HeGAN employs two major novelties to address the challenges of adversarial 
learning on HINs. 

First, existing studies only leverage GAN to distinguish whether a node is real 
or fake w.rt. structural connections to a given node, without accounting for the het- 
erogeneity in HINs. For example, given a paper p2, they treat nodes az, a4 as real, 
whereas a, a3 are fake simply based on the topology of the HIN shown in Fig. 16.7 
(a). However, az and a4 are connected to p2 for different reasons: a2 writes p2 and 
a4 only views p2. Thus, they miss out on valuable semantics carried by HGs, un- 
able to differentiate a) and a4 even though they play distinct semantic roles. Given 
a paper p2 as well as a relation, say, write/written, HeGAN introduces a relation- 
aware discriminator to tell apart a2 and a4. Formally, relation-aware discriminator 
C(e, | u,r; 0C) evaluates the connectivity between the pair of nodes u and v w.rt. a 


relation r: 1 


C(e, | u,r; O°) = 7 , 
1+exp(—eC MCe,) 


(16.16) 


where e, € R?*! is the input embedding of the sample v, e,, € R@*! is the learnable 
embedding of node u, and MC € R4*¢ is a learnable relation matrix for relation r. 
Second, existing studies are limited in sample generation in both effectiveness 
and efficiency. They typically model the distribution of nodes using some form of 
softmax over all nodes in the original graph. In terms of effectiveness, their fake 
samples are constrained to the nodes in the graph, whereas the most representative 
fake samples may fall “in between” the existing nodes in the embedding space. For 
example, given a paper p2, they can only choose fake samples from ¥, such as 
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a, and a3. However, both may not be adequately similar to real samples such as 
a2. Towards a better sample generation, we introduce a generalized generator that 
can produce latent nodes such as a’ shown in Fig. 16.7 (c), where it is possible 
that a’ ¢ VY. In particular, the generalized generator leverage the following Gaussian 
distribution: 

N (e£ MS, oI), (16.17) 


where ef € R?*! and MS € R?*4 denote the node embedding of u € VY and the 
relation matrix of r € Æ for the generator. 

Except for HeGAN, MV-ACM (Zhao et al, 2020c) uses GAN to generate the 
complementary views by computing the similarity of nodes in different views. Over- 
all, adversarial-based methods prefer to utilize the negative samples to enhance the 
robustness of embeddings. But the choice of negative samples has a huge influence 
on the performance, thus leading higher variances. 


16.4 Review 


Based on the above representative work of the shallow and deep models, it can be 
found that the shallow models mainly focus on the structure of HGs, and rarely 
use additional information such as attributes. One of the possible reasons is that 
shallow models are hard to depict the relationship between additional and struc- 
tural information. The learning ability of DNNs supports modeling of this complex 
relationship. For example, message passing-based techniques are good at encod- 
ing structures and attributes simultaneously, and integrate different semantic infor- 
mation. Compared with message passing-based techniques, encoder-decoder-based 
techniques are weak in fusing information due to the lack of messaging mechanism. 
But they are more flexible to introduce different objective functions through differ- 
ent decoders. Adversarial-based methods prefer to utilize the negative samples to 
enhance the robustness of embeddings. But the choice of negative samples has a 
huge influence on the performance, thus leading higher variances (Hu et al, 2019a). 

However, shallow and deep models each have their own pros and cons. Shallow 
models lack non-linear representation capability, but are efficient and easy to par- 
allelize. Specially, the complexity of random walk technique consists of two parts: 
random walk and skip-gram, both of which are linear with the number of nodes. De- 
composition technique needs to divide HGs into sub-graphs according to the type 
of edges, so the complexity is linear with the number of edges, which is higher 
than random walk. Deep models have stronger representation capability, but they 
are easier to fit noise and have higher time and space complexity. Additionally, the 
cumbersome hyperparameter adjustment of deep models is also criticized. But with 
the popularity of deep learning, deep models, especially HGNNs, have become the 
main research direction in HG embedding. 
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16.5 Future Directions 


HGNNs have made great progress in recent years, which clearly shows that it is a 
powerful and promising graph analysis paradigm. In this section, we discuss addi- 
tional issues/challenges and explore a series of possible future research directions. 


16.5.1 Structures and Properties Preservation 


The basic success of HGNNs builds on the HG structure preservation. This also 
motivates many HGNNs to exploit different HG structures, where the most typical 
one is meta-path (Dong et al, 2017; Shi et al, 2016). Following this line, meta-graph 
structure is naturally considered (Zhang et al, 2018b). However, HG is far more than 
these structures. Selecting the most appropriate meta-path is still very challenging in 
the real world. An improper meta-path will fundamentally hinder the performance 
of HGNNs. Whether we can explore other techniques, e.g., motif (Zhao et al, 2019a; 
Huang et al, 2016b) or network schema (Zhao et al, 2020b) to capture HG structure 
is worth pursuing. Moreover, if we rethink the goal of traditional graph embedding, 
i.e., replacing structure information with the distance/similarity in a metric space, a 
research direction to explore is whether we can design HGNNs which can naturally 
learn such distance/similarity rather than using pre-defined meta-path/meta-graph. 

As mentioned before, many current HGNNs mainly take the structures into ac- 
count. However, some properties, which usually provide additional useful infor- 
mation to model HGs, have not been fully considered. One typical property is the 
dynamics of HG, i.e., a real-world HG always evolves over time. Despite that the 
incremental learning on dynamic HG is proposed (Wang et al, 2020m), dynamic 
heterogeneous graph embedding is still facing big challenges. For example, Bian 
et al (2019) is only proposed with a shallow model, which greatly limits its embed- 
ding ability. How can we learn dynamic heterogeneous graph embedding in HGNNs 
framework is worth pursuing. The other property is the uncertainty of HG, i.e., the 
generation of HG is usually multi-faceted and the node in a HG contains different 
semantics. Traditionally, learning a vector embedding usually cannot well capture 
such uncertainty. Gaussian distribution may innately represent the uncertainty prop- 
erty (Kipf and Welling, 2016; Zhu et al, 2018), which is largely ignored by current 
HGNNs. This suggests a huge potential direction for improving HGNNs. 


16.5.2 Deeper Exploration 


We have witnessed the great success and large impact of GNNs, where most of the 
existing GNNs are proposed for homogeneous graph (Kipf and Welling, 2017b; 
Veličković et al, 2018). Recently, HGNNs have attracted considerable attention 
(Wang et al, 2019m; Zhang et al, 2019b; Fu et al, 2020; Cen et al, 2019). 
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One natural question may arise that what is the essential difference between 
GNNs and HGNNs. More theoretical analysis on HGNNs is seriously lacking. For 
example, it is well accepted that the GNNs suffer from over-smoothing problem (Li 
et al, 2018b), so will HGNNs also have such a problem? If the answer is yes, what 
factor causes the over-smoothing problem in HGNNs since they usually contain 
multiple aggregation strategies (Wang et al, 2019m; Zhang et al, 2019b). 

In addition to theoretical analysis, new technique design is also important. One 
of the most important directions is the self-supervised learning. It uses the pre- 
text tasks to train neural networks, thus reducing the dependence on manual la- 
bels (Liu et al, 2020f). Considering the actual demand that label is insufficient, 
self-supervised learning can greatly benefit the unsupervised and semi-supervised 
learning, and has shown remarkable performance on homogeneous graph embed- 
ding (Veličković et al, 2018; Sun et al, 2020c). Therefore, exploring self-supervised 
learning on HGNNs is expected to further facilitate the development of this area. 

Another important direction is the pre-training of HGNNs (Hu et al, 2020d; Qiu 
et al, 2020a). Nowadays, HGNNs are designed independently, i.e., the proposed 
method usually works well for certain tasks, but the transfer ability across differ- 
ent tasks is ill-considered. When dealing with a new HG or task, we have to train 
HGNNs from scratch, which is time-consuming and requires a large amount of la- 
bels. In this situation, if there is a well pre-trained HGNN with strong generaliza- 
tion that can be fine-tuned with few labels, the time and label consumption can be 
reduced. 


16.5.3 Reliability 


Except for properties and techniques in HGs, we are also concerned about ethical 
issues in HGNNs, such as fairness, robustness, and interpretability. Considering that 
most methods are black boxes, making HGNNa reliable is an important future work. 

Fairness. The embeddings learned by methods are sometimes highly related to 
certain attributes, e.g., age or gender, which may amplify societal stereotypes in the 
prediction results (Du et al, 2020). Therefore, learning fair or de-biased embeddings 
is an important research direction. There are some researches on the fairness of 
homogeneous graph embedding (Bose and Hamilton, 2019; Rahman et al, 2019). 
However, the fairness of HGNNs is still an unsolved problem, which is an important 
research direction in the future. 

Robustness. Also, the robustness of HGNNs, especially the adversarial attack- 
ing, is always an important problem (Madry et al, 2017). Since many real-world 
applications are built based on HGs, the robustness of HGNNs becomes an urgent 
yet unsolved problem. What is the weakness of HGNNs and how to enhance it to 
improve the robustness need to be further studied. 

Interpretability. Moreover, in some risk-aware scenarios, e.g., fraud detection 
(Hu et al, 2019b) and bio-medicine (Cao et al, 2020), the explanation of mod- 
els or embeddings is important. A significant advantage of HG is that it contains 
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rich semantics, which may provide eminent insight to promote the explanation of 
HGNNs. Besides, the emerging disentangled learning (Siddharth et al, 2017; Ma 
et al, 2019c), which divides the embedding into different latent spaces to improve 
the interpretability, can also be considered. 


16.5.4 Applications 


Many HG-based applications have stepped into the era of graph embedding. There 
have demonstrated the strong performance of HGNNs on E-commerce and cyber- 
security. Exploring more capacity of HGNNs on other areas holds great potential in 
the future. For example, in software engineering area, there are complex relations 
among test sample, requisition form, and problem form, which can be naturally 
modeled as HGs. Therefore, HGNNs are expected to open up broad prospects for 
these new areas and become a promising analytical tool. Another area is the bio- 
logical system, which can also be naturally modeled as a HG. A typical biological 
system contains many types of objects, e.g., Gene Expression, Chemical, Pheno- 
type, and Microbe. There are also multiple relations between Gene Expression and 
Phenotype (Tsuyuzaki and Nikaido, 2017). HG structure has been applied to bio- 
logical system as an analytical tool, implying that HGNNs are expected to provide 
more promising results. 

In addition, since the complexity of HGNNs are relatively large and the tech- 
niques are difficult to parallelize, it is difficult to apply the existing HGNNs to 
large-scale industrial scenarios. For example, the number of nodes in E-commerce 
recommendation may reach one billion (Zhao et al, 2019b). Therefore, successful 
technique deployment in various applications while resolving the scalability and 
efficiency challenges will be very promising. 


Editor’s Notes: The concept of the heterogeneous graph is essentially orig- 
inated from the data mining domain. Although heterogeneous graphs can 
usually be formulated as attributed graphs (Chapter 4), the research fo- 
cus of the former is typically the frequent combinatorial patterns of node 
types in a subgraph (usually a path). Heterogeneous graphs represent a wide 
range of real-world applications which usually consist of multiple, hetero- 
geneous data sources. For example, in recommender systems introduced 
in Chapter 19, we have both the “user” node and “item” node as well as 
higher-order patterns formed by multi-node types. Similarly, molecules and 
proteins as well as many networks in Natural Language Processing and Pro- 
gram Analysis can also be considered as heterogeneous graphs (see Chap- 
ters 21,22,24,25). 
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Abstract Graph neural networks (GNNs) are efficient deep learning tools to analyze 
networked data. Being widely applied in graph analysis tasks, the rapid evolution of 
GNNs has led to a growing number of novel architectures. In practice, both neural 
architecture construction and training hyperparameter tuning are crucial to the node 
representation learning and the final model performance. However, as the graph data 
characteristics vary significantly in the real-world systems, given a specific scenario, 
rich human expertise and tremendous laborious trials are required to identify a suit- 
able GNN architecture and training hyperparameters. Recently, automated machine 
learning (AutoML) has shown its potential in finding the optimal solutions automat- 
ically for machine learning applications. While releasing the burden of the manual 
tuning process, AutoML could guarantee access of the optimal solution without ex- 
tensive expert experience. Motivated from the previous successes of AutoML, there 
have been some preliminary automated GNN (AutoGNN) frameworks developed 
to tackle the problems of GNN neural architecture search (GNN-NAS) and train- 
ing hyperparameter tuning. This chapter presents a comprehensive and up-to-date 
review of AutoGNN in terms of two perspectives, namely search space and search 
algorithm. Specifically, we mainly focus on the GNN-NAS problem and present the 
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state-of-the-art techniques in these two perspectives. We further discuss the open 
problems related to the existing methods for future research. 


17.1 Background 


Graph neural networks (GNNs) have made substantial progress in integrating deep 
learning approaches to analyze graph-structured data collected from various do- 
mains, such as social networks (Ying et al, 2018b; Huang et al, 2019d; Monti et al, 
2017; He et al, 2020), academic networks (Yang et al, 2016b; Kipf and Welling, 
2017b; Gao et al, 2018a), and biochemical modular graphs (Zitnik and Leskovec, 
2017; Aynaz Taheri, 2018; Gilmer et al, 2017; Jiang and Balaprakash, 2020). Fol- 
lowing the common message passing strategy, GNNs apply spatial graph convolu- 
tional layer to learn a node’s embedding representation via aggregating the repre- 
sentations of its neighbors and combining them to the node itself. A GNN archi- 
tecture is then constructed by the stacking of multiple such layers and their inter- 
layer skip connections, where the elementary operations of a layer (e.g., aggrega- 
tion & combination functions) and the concrete inter-layer connections are specified 
specifically in each design. To adapt to different real-world applications, a variety of 
GNN architectures have been explored, including GCN (Kipf and Welling, 2017b), 
GraphSAGE (Hamilton et al, 2017b), GAT (Veličković et al, 2018), SGC (Wu et al, 
2019a), JKNet (Xu et al, 2018a), and GCNII (Chen et al, 20201). They vary in how 
to aggregate the neighborhood information (e.g., mean aggregation in GCN versus 
neighbor attention learning in GAT) and the choices of skip connections (e.g., none 
connection in GCN versus initial connection in GCNII). 

Despite the significant success of GNNs, their empirical implementations are 
usually accompanied with careful architecture engineering and training hyperpa- 
rameter tuning, aiming to adapt to the different types of graph-structured data. 
Based on the researcher’s prior knowledge and trial-and-error tuning processes, a 
GNN architecture is instantiated from its model space specifically and evaluated in 
each graph analysis task. For example, considering the underlying model Graph- 
SAGE (Hamilton et al, 2017b), the various-size architectures determined by the 
different hidden units are applied respectively for citation networks and protein- 
protein interaction graphs. Furthermore, the optimal skip connection mechanisms 
in JKNet architectures (Xu et al, 2018a) vary with the real-world tasks. Except the 
architecture engineering, the training hyperparameters play important roles in the 
final model performance, including learning rate, weight decay, and epoch num- 
bers. In the open repositories, their hyperparameters are manually manipulated to 
get the desired model performances. The tedious selections of GNN architectures 
and training hyperparameters not only burden data scientists, but also make it dif- 
ficult for beginners to access the high-performance solutions quickly for their tasks 
on hand. 

Automated machine learning (AutoML) has emerged as a prevailing research to 
liberate the community from the time-consuming manual tuning processes (Chen 
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et al, 2021). Given any task and based on the predefined search space, AutoML 
aims at automatically optimizing the machine learning solutions (or denoted with 
the term designs), including neural architecture search (NAS) and automated hyper- 
parameter tuning (AutoHPT). While NAS targets the optimization of architecture- 
related parameters (e.g., the layer number and hidden units), AutoHPT indicates the 
selections of training-related parameters (e.g., the learning rate and weight decay). 
They are the sub-fields of AutoML. It has been widely reported that the novel neu- 
ral architectures discovered by NAS outperform the human-designed ones in many 
machine learning applications, including image classification (Zoph and Le, 2016; 
Zoph et al, 2018; Liu et al, 2017b; Pham et al, 2018; Jin et al, 2019a; Luo et al, 2018; 
Liu et al, 2018b,c; Xie et al, 2019a; Kandasamy et al, 2018), semantic image seg- 
mentation (Chenxi Liu, 2019), and image generation (Wang and Huan, 2019; Gong 
et al, 2019). Dating back to 1900’s (Kohavi and John, 1995), it has been commonly 
acknowledged that AutoHPT could improve over the default training setting (Feurer 
and Hutter, 2019; Chen et al, 2021). Motivated by the previous successful applica- 
tions of AutoML, there have been some recent efforts on conjoining the researches 
of AutoML and GNNs (Gao et al, 2020b; Zhou et al, 2019a; You et al, 2020a; 
Ding et al, 2020a; Zhao et al, 2020a,g; Nunes and Pappa, 2020; Li and King, 2020; 
Shi et al, 2020; Jiang and Balaprakash, 2020). They generally define the automated 
GNN (AutoGNN) as an optimization problem and formulate their own working 
pipelines from three perspectives, as shown in Figure 17.1, the search space, search 
algorithm, and performance estimation strategy. The search space consists of a large 
volume of candidate designs, including GNN architectures and the training hyper- 
parameters. On top of the search space, several heuristic search algorithms are pro- 
posed to solve the NP-complete optimization problem by iteratively approximating 
the well-performing designs, including random search (You et al, 2020a). The ob- 
jective of performance estimation is to accurately estimate the task performance of 
every candidate design explored at each step. Once the search progress terminates, 
the best neural architecture accompanied with the suitable training hyperparameters 
is returned to be evaluated on the downstream machine learning task. 

In this chapter, we will organize the existing efforts and illustrate AutoGNN 
framework with the following sections: notations, problem definition, and chal- 
lenges of AutoGNN (in Sections 17.1.1, 17.1.2, and 17.1.3), search space (in Sec- 
tion 17.2), and search algorithm (in Section 17.3). We then present the open prob- 
lems for future research in Section 17.4. Specially, since the community’s interests 
mainly focus on discovering the powerful GNN architecture, we pay more attentions 
to GNN-NAS in this chapter. 


17.1.1 Notations of AutoGNN 


Following the previous expressions (You et al, 2020a), we use the term “design” 
to refer to an available solution of the optimization problem in AutoGNN. A de- 
sign consists of a concrete GNN architecture and a specific set of training hy- 
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Fig. 17.1: Illustration of a general framework for AutoGNN. The search space con- 
sists of plenty of designs, including GNN architectures and the training hyperparam- 
eters. At each step, the search algorithm samples a candidate design from the search 
space and estimates its model performance on the downstream task. Once the search 
progress terminates, the best design accompanied with the highest performance on 
the validation set is returned and exploited for the real-world system. 


perparameters. Specifically, the design is characterized by multiple dimensions, 
including architecture dimensions (e.g., the layer number, skip connections, ag- 
gregation, and combination functions) and hyperparameter dimensions (e.g., the 
learning rate and weight decay). Along each design dimension, there is a se- 
ries of different elementary options provided to support the automated architec- 
ture engineering or training hyperparameter tuning. For example, we could have 
candidates {SUM, MEAN, MAX} at the aggregation function dimension, and use 
{le-4, 5e-4, le-3, 5e-3,0.01,0.1} at the learning rate dimension. Given the series 
of candidate options along each dimension, the search space in AutoGNN is con- 
structed by Cartesian product of all the design dimensions. A design is instanti- 
ated by assigning concrete values to these dimensions, such as a GNN architecture 
with the aggregation function of MEAN and learning rate of le-3. Note that GNN- 
NAS and AutoHPT explore in the search spaces consisted of expansive GNN archi- 
tectures and hyperparameter combinations, respectively; AutoGNN optimizes in a 
more comprehensive search space containing both of them. 
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17.1.2 Problem Definition of AutoGNN 


Before diving into detailed techniques, we examine the essence of AutoGNN by 
formally defining its optimization problem. To be specific, let F be the search space. 
Let Yrain and Pyara be the training and validation sets, respectively. Let M be the 
performance evaluation metric of a design in any given graph analysis task, e.g., F1 
score or accuracy in the node classification task. The objective of AutoGNN is to 
find the optimal design f* € F in terms of M evaluated on the validation set Dyatia- 
Formally, AutoGNN requires solving the following bi-level optimization problem: 


f* = argmax e.g M (f (0%); Dvaia), (17.1) 
s.t. 0* =argming L(f (0); Drain): 

where 0* denotes the optimized trainable weights of design f and L denotes the loss 
function. For each design, AutoGNN will first optimize its associated weights @ by 
minimizing the loss on the training set through gradient descent, and then evaluates 
it on the validation set to decide whether this design is the optimal one. By solving 
the above optimization problem, AutoGNN automates the architecture engineering 
and training hyperparameter tuning procedure, and pushes GNN designs to exam- 
ine a broad scope of candidate solutions. However, it is well known that such the 
bi-level optimization problem is NP-complete (Chen et al, 2021), thereby it would 
be extremely time-consuming for searching and evaluating the well-performing de- 
signs on large graphs with massive nodes and edges. Fortunately, there have been 
some heuristic search techniques proposed to locate the local optimal design (e.g., 
CNN or RNN architecture) as close as possible to the global one in the applications 
of image classification and natural language processing, including reinforcement 
learning (RL) (Zoph and Le, 2016; Zoph et al, 2018; Pham et al, 2018; Cai et al, 
2018a; Baker et al, 2016), evolutionary methods (Liu et al, 2017b; Real et al, 2017; 
Miikkulainen et al, 2019; Xie and Yuille, 2017; Real et al, 2019), and Bayesian op- 
timization (Jin et al, 2019a). They iteratively explore the next design and update the 
search algorithm based on the performance feedback of the new design, in order to 
move toward the global optimal solution. Compared with the previous efforts, the 
characteristics of AutoGNN problem could be viewed from two aspects: the search 
space and search algorithms tailored to identify the optimal design of GNN. In the 
following sections, we list the challenge details and the existing AutoGNN work. 


17.1.3 Challenges in AutoGNN 


The direct application of existing AutoML frameworks to automate GNN designs is 
non-trivial, due to the two major challenges as follows. 

First, the search space of AutoGNN is significantly different from the ones in the 
AutoML literature. Taking NAS applied in discovering CNN architectures (Zoph 
and Le, 2016) as an example, the search space of convolution operation is mainly 
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specified by the convolutional kernel size. In contrast, considering the message- 
passing based graph convolution, the search space of spatial graph convolution is 
constructed by multiple key architecture dimensions, including aggregation, com- 
bination, and embedding activation functions. With the growing number of GNN 
model variants, it is important to formulate a good search space being both ex- 
pressive and compact. On the one hand, the search space should cover the impor- 
tant architecture dimensions to subsume the existing human-designed architectures 
and adapt to a series of diverse graph analysis tasks. On the other hand, the search 
space should be compact by excluding the non-general dimensions and incorporat- 
ing modest ranges of options along each dimension, in order to save the search time 
cost. 

Second, the search algorithm should be tailored to discover the well-performing 
design efficiently based on the special search space in AutoGNN. The search con- 
troller determines how to iteratively explore the search space and update the search 
algorithm according to the performance feedbacks of sampled designs. A good con- 
troller needs to balance the trade-off between exploration and exploitation during 
the search progress, in order to avoid the premature sub-optimal region and quickly 
discover the well-performing designs, respectively. However, the previous search 
algorithms may be inefficient to the application of GNN-NAS. Specially, one of the 
key properties in GNN architectures is that the model performance may vary sig- 
nificantly with a slight modification along an architecture dimension. For example, 
it has been theoretically and empirically demonstrated that the graph classification 
accuracy could be improved by simply replacing the max pooling with summa- 
tion in the aggregation function dimension of GNN (Xu et al, 2019d). The previ- 
ous RL-based methods sample and evaluate the whole architecture at each search 
step. It would be hard for the search algorithms to learn the following relationship 
towards exploring better GNN: which part of the architecture dimension modifica- 
tions improves or degrades the model performance. Another challenging problem is 
the surge of new graph analysis tasks, which requires huge computation resources to 
optimize GNN architectures. Instead of searching the optimal GNN from scratch, it 
is crucial to transfer the well-performing architectures discovered before to the new 
task to save the expensive computation cost. 


17.2 Search Space 


In this section, we summarize the search spaces in literature. As shown in Fig- 
ure 17.2, the search spaces of designs in AutoGNN are differentiated according to 
GNN architectures and training hyperparameters, whose details are listed as below. 
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17.2.1 Architecture Search Space 


Considering the existing AutoGNN frameworks (Gao et al, 2020b; Zhou et al, 
2019a), GNN model is commonly implemented based on the spatial graph convolu- 
tion mechanism. To be specific, the spatial graph convolution takes the input graph 
as a computation graph and learns node embeddings by passing messages along 
edges. A node embedding is updated recursively by aggregating the embedding rep- 
resentations of its neighbors and combining them to the node itself. Formally, the 
k-th spatial graph convolutional layer of GNN could be expressed as: 


h? = AGGREGATE ({a) Wx") ; j € N(i)}), iri 
x! = ACT(COMBINE(W x) h). | 


l 
x denotes the embedding vector of node v; at the k-th layer. “~ (i) denotes the set 
of neighbors adjacent to node vj. Ww) denotes the trainable weight matrix used to 
project node embeddings. al” denotes the message-passing weight along edge con- 
necting nodes v; and v;, which is determined by normalized graph adjacency ma- 
trix or learned from attention mechanism. Function AGGREGATE, such as mean, 
max, and sum pooling, is used to aggregate neighbor representations. Function 
COMBINE is used to combine neighbor embedding ni”) as well as node embed- 
ding xf) from the last layer. Finally, function ACT (e.g., ReLU) is used to add 
non-linearity to the embedding learning. 

As shown in Figure 17.2, GNN architecture consists of several graph convolu- 
tional layers defined in Eq. equation 17.2, and may incorporate skip connection be- 
tween any two arbitrary layers similar to residual CNN (He et al, 2016a). Following 
the previous definitions in NAS, we use the term “micro-architecture” to represent 
a graph convolutional layer, including the specifications of hidden units and graph 
convolutional functions; we use the term “macro-architecture” to represent network 
topology, including the choices of layer depth, inter-layer skip connections, and 
pre/post-processing layers. The architecture search space contains a large volume 
of diverse GNN architectures, which could be categorized into the search spaces of 
micro-architectures as well as macro-architectures. 


17.2.1.1 Micro-architecture Search Space 


According to Eq. equation 17.2 and as shown in Figure 17.2, the micro-architecture 
of a graph convolutional layer is characterized by the following five architecture 
dimensions: 


e Hidden units: Trainable matrix W e RA“) xa maps node embeddings to 


a new space and learns to extract the informative features. d () is the number 
of hidden units and plays key role in the task performance. In the GNN-NAS 
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Fig. 17.2: Illustration of a comprehensive search space, which consists of micro- 
architecture, macro-architecture, and training hyperpameter search spaces. Each 
space is characterized by multiple dimensions, such as hidden units, propagation 
function, etc, in the micro-architecture search space. Each dimension provides a se- 
ries of candidate options, and the search space is constructed by Cartesian product 
of all its dimensions. A discrete point in the comprehensive search space represents 
a specific design, which adopts one option at each dimension. 


frameworks of GraphNAS (Gao et al, 2020b) and AGNN (Zhou et al, 2019a), 
d) is usually selected from set {4,8, 16,32,64, 128,256}. 
(k) 


Propagation function: It determines the message-passing weight a; ; to spec- 
ify how node embeddings are propagated upon the input graph structure. In 


a wide variety of GNN models (Kipf and Welling, 2017b; Wu et al, 2019a; 
Hamilton et al, 2017b; Ding et al, 2020a), a, is defined by the correspond- 


ing element from the normalized adjacency matrix: D-?AD~? or D~!A, where 
A is the self-loop graph adjacency matrix and D is its degree matrix, respec- 
tively. Note that the real-world graph-structured data could be both complex 
and noisy (Lee et al, 2019c), which leads to the inefficient neighbor aggregation. 


GAT (Veličković et al, 2018) applies attention mechanism to compute a‘) to at- 
tend on relevant neighbors. Based on the existing GNN-NAS frameworks (Gao 
et al, 2020b; Zhou et al, 2019a; Ding et al, 2020a), we list the common choices 
of propagation functions in Table 17.1. 

Aggregation function: Depending on the input graph structure, a proper ap- 
plication of aggregation function is important to learn the informative neighbor 
distribution (Xu et al, 2019d). For example, a mean pooling function takes the 
average of neighbors, while a max pooling only preserves the significant one. 
The aggregation function is usually selected from set {SUM, MEAN, MAX}. 
Combination function: It is used to combine neighbor embedding ni” and 


projected embedding Wh) of the node itself. Examples of combination 
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function include sum and multiple layer perceptron (MLP), etc. While the sum 
operation simply adds the two embeddings, MLP further applies linear mapping 
based upon the summation or concatenation of these two embeddings. 

e Activation function: The candidate activation function is usually selected from 
{Sigmoid, Tanh, ReLU, Linear, Softplus, LeakyReLU, ReLU6, ELU}. 


Given the above five architecture dimensions and their associated candidate op- 
tions, the micro-architecture search space is constructed by their Cartesian product. 
Each discrete point in the micro-architecture search space corresponds to a concrete 
micro-architecture, e.g., a graph convolutional layer with {Hidden units: 64, Propa- 
gation function: GAT, aggregation function: SUM, combination function: MLP, Ac- 
tivation function: ReLU}. By providing the extensive candidate options along each 
dimension, the micro-architecture search space covers most of layer implementa- 
tions in the state-of-the-art models, such as Chebyshev (Defferrard et al, 2016), 
GCN (Kipf and Welling, 2017b), GAT (Veličković et al, 2018), and LGCN (Gao 
et al, 2018a). 


(k) 


Table 17.1: Propagation function candidates to compute weight a;;’ if nodes v; and 


(k) 


= 0. Symbol || denotes the concatenation operation, 


a, a; and a; denote trainable aos and wi ) i 


v; are connected; otherwise aij 


is a trainable matrix. 
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17.2.1.2 Macro-architecture Search Space 


Besides the micro-architecture, another architectural level of GNN is its macro- 
architecture as shown in Figure 17.2, i.e., the network topology. The macro-architecture 
of GNN specifies the numbers of graph convolutional layers as well as pre/post- 
processing layers, and the choices of skip connections (You et al, 2020a; Li et al, 
2018b, 2019c). We list the details of these four architecture dimensions in the fol- 
lowing. 
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Graph convolutional layer depth: The direct stacking of multiple layers is 
commonly adopted to improve the reception fields of nodes. Let gc denote the 
number of graph convolutional layers. lgc is usually selected from range [2, 10]. 
Pre-processing layer depth: In real-world applications, the length of nodes’ 
input features may be too large and leads to costly computation in hidden fea- 
ture learning. The feature pre-processing is included in search space (You et al, 
2020a) for the first time and conducted by MLP, whose layer number is denoted 
as Ipre. pre is sampled from candidates {0,1,2,3}. 

Post-processing layer depth: Similarly, the post-processing layers of MLP are 
applied to project hidden embeddings into task-specific space, e.g., the embed- 
ding space with dimensions the same as class labels in the node classification 
task. Let post denote the layer number with examples {0,1,2,3}. 

Skip connections: Following the residual deep CNNs in computer vision and 
the recent deep GNNs, skip connections have been incorporated in the search 
space of GNN-NAS frameworks (You et al, 2020a; Zhao et al, 2020g,a). To 
be specific, at layer l, the embeddings of up to / — 1 previous layers could be 
sampled and combined to the current layer’s output, leading to 2‘! possible 
decisions at layer k. For the prior node embeddings that are connected to the 
current output, there have been a series of candidate options developed to com- 
bine them, namely {SUM,CAT,MAX,LSTM}. Specially, option SUM, CAT 
or MAX adds, concatenates or element-wisely max pools these connected em- 
beddings. LSTM uses an attention mechanism to compute the importance score 
of each layer, and then obtain the weighted average of the connected embed- 
dings (Xu et al, 2018a). 


The entire architecture space is constructed by Cartesian product of the micro and 


macro-architecture search spaces, which is totally characterized by the nine archi- 


tec 


ture dimensions. It could be extremely huge and comprehensive to subsume the 


recent residual GNN models, such as JKNet (Xu et al, 2018a) and deeperGCN (Li 
et al, 2018b). 


17.2.2 Training Hyperparameter Search Space 


The training hyperparameters have significant impacts on the task performances of 
GNN architectures, and have been explored in AutoGNN frameworks (You et al, 
2020a; Shi et al, 2020). We summarize four important dimensions of training hy- 
perparameters in the following and show them in Figure 17.2. 


Dropout rate: At the beginning of each graph convolutional layer or pre/post- 
processing layer, a proper dropout rate is crucial to avoid the over-fitting issue. 
The widely-used examples are {False,0.05,0.1,0.2,0.3,0.4,0.5,0.6}. 


e Batch normalization: It is applied after graph convolutional layer or pre/post- 


processing layer to normalize node embeddings of the whole graph or a batch (Zhou 
et al, 2020d; Zhao and Akoglu, 2019; Ioffe and Szegedy, 2015). The candidate 
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normalization techniques include {False, BatchNorm (Io f feandSzegedy, 2015), 
PairNorm (ZhaoandAkoglu, 2019), DGN (Zhouet al, 2020d), 

NodeNorm (Zhouet al, 2020c),GraphNorm (Caiet al, 2020d)}. 

Learning rate: While a larger learning rate leads to a premature suboptimal 
solution, a smaller one will make the optimization process converge slowly. 
The candidate learning rates are {1e-4,5e-4, le-3, 5e-3,0.01,0.1}. 

Training epoch: According to the common practice (You et al, 2020a; Kipf and 
Welling, 2017b), the training epoch examples are { 100,200, 400,500, 1000}. 


17.2.3 Efficient Search Space 


Given the micro-architecture, macro-architecture, and training hyperparameters 
search spaces, in the practical systems, the applied search space is formulated by 
Cartesian product of any combination of them. Although a large search space sub- 
sumes the diverse GNN architectures and training environments to adapt to the dif- 
ferent graph analysis tasks, it would be time-consuming to explore the optimal de- 
sign. To make the search progress efficient, there are two mainstream simplifying 
search spaces applied in the existing AutoGNN frameworks. 


Focus on GNN-NAS: Instead of fully tuning the training hyperparameters, 
most of AutoGNN (or GNN-NAS) frameworks (Gao et al, 2020b; Zhou et al, 
2019a; Zhao et al, 2020a,g; Ding et al, 2020a; Nunes and Pappa, 2020; Li and 
King, 2020; Jiang and Balaprakash, 2020) focus on tackling the problem of dis- 
covering the well-performing GNN architectures. Comparing with AutoHPT, 
it is commonly acknowledged that a novel architecture discovered from GNN- 
NAS is more important and challenging to the research community, which could 
motivate the data scientist to improve GNN model paradigms in the future. In 
GNN-NAS, the search space is thus reduced to the one containing only the neu- 
ral architecture variants. 

Simplify architecture search space: Even in GNN-NAS, the plenty of archi- 
tecture dimensions and their associated candidate options still make the search 
space complex. Based on the prior knowledge about the impacts of different 
modules on model performances, one would prefer to explore only along the 
crucial architecture dimensions in the practical systems. For example, it is found 
that the simplified search space (Zhao et al, 2020a) characterized by aggregation 
function and skip connections could generate the high-performance GNN archi- 
tectures comparable to ones from the comprehensive search spaces (Gao et al, 
2020b; Zhou et al, 2019a). Specially, since the decision cardinality of skip con- 
nections increases exponentially with layers, the simplified search space even 
only explores the skip connections in the last layer similar to JKNet (Xu et al, 
2018a). In another simplified search space, the model-specific architecture di- 
mensions are excluded and pre-defined based on expert experiences, including 
the hidden units, propagation function, and combination function. 
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17.3 Search Algorithms 


Many different search strategies can be used to explore the search space in Au- 
toGNN, including random search, evolutionary methods, RL, and differentiable 
search methods. In this section, we will introduce the basic concepts of these search 
algorithms and how to utilize them to explore candidate designs. 


17.3.1 Random Search 


Given a search space, random search randomly samples the various designs with 
equal probability. The random search is the most basic approach, yet it is quite ef- 
fective in practice. In addition to serve as a baseline in AutoGNN works (Zhou et al, 
2019a; Gao et al, 2020b), random search is the standard benchmark for compar- 
ing the effectiveness of different candidate options along a dimension in the search 
space (You et al, 2020a). Specially, suppose the dimension to be evaluated is batch 
normalization, whose candidate examples are given by {False, BatchNorm}. To 
comprehensively compare the effectiveness of these two options, a series of diverse 
designs are randomly sampled from the search space, where the batch normalization 
is reset to False and BatchNorm in each design, respectively. Each pair of designs 
(referred to Normalization=False and Normalization=BatchNorm) are compared in 
terms of their model performances on a downstream graph analysis task. It is found 
that the designs with Normalization=BatchNorm generally rank higher than the oth- 
ers, which indicates the benefit of including BatchNorm in the model design. 


17.3.2 Evolutionary Search 


Evolutionary methods evolve a population of designs, i.e., the set of different GNN 
architectures and training hyperparameters. In every evolution step, at least one de- 
sign from the population is sampled and serves as a parent to generate a new child 
design by applying mutations to it. In the context of AutoGNN, the design muta- 
tions are local operations, such as changing the aggregation function from MAX to 
SUM, altering the hidden units, and altering a specific training hyperparameter. Af- 
ter training the child design, its performance is evaluated on the validation set. The 
superior design will be added to the population. Specifically, Shi et al (2020) pro- 
poses to select two parent designs and then crossover them along some dimensions. 
To generate the diverse child designs, Shi et al (2020) further mutates the above 
crossover designs. 
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17.3.3 Reinforcement Learning Based Search 


RL (Silver et al, 2014; Sutton and Barto, 2018) is a learning paradigm concerned 
with how agents ought to take actions in an environment to maximize the reward. 
In the context of AutoGNN, the agent is the so-called “controller”, which tries to 
generate promising designs. The generation of design can be regarded as the con- 
troller’s action. The controller’s reward is often defined as the model performance 
of generated design on the validation set, such as validation accuracy for the node 
classification task. The controller is trained in a loop as shown in Figure 17.3: the 
controller first samples a candidate design and trains it to convergence to measure 
its performance on the task of desire. Note that the controller is usually realized by 
RNN, which generates the design of GNN architecture and training hyperparam- 
eters as a string of variable strength. The controller then uses the performance as 
a guiding signal to update itself toward finding the more promising design in the 
future search progress. 
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Fig. 17.3: A illustration of reinforcement learning based search algorithm. The con- 
troller (upper block) generates a GNN architecture (lower block) and tests it on the 
validation dataset. By treating the architecture as a string with variable length, the 
controller usually applies RNN to sequentially sample options in the different di- 
mensions (e.g., combination, aggregation, and propagation functions) to formulate 
the final GNN architecture. The validation performance is then used as feedback to 
train the controller. Note that the architecture dimensions here are just used for the 
illustration purpose. Please refer to Section 17.2 for a complete introduction of the 
search space. 
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The existing RL-based AutoGNN frameworks target at the sub-field problem 
of GNN-NAS. Generally, in RL-based GNN-NAS, there are two sets of trainable 
parameters: the parameters of the controller, denoted by œ, and the parameters of a 
GNN architecture, denoted by @. The training procedure consists of two interleaving 
phases, which alternatively solves the bi-level optimization problem as shown in 
Eq. equation 17.1. The first phase trains 0 on the training data set Fain with a fixed 
number of epochs using standard back-propagation. The second phase trains @ to 
learn to sample high-performance GNN architectures evaluated on the validation 
set Žalia. These two phases are alternated during the training. Specifically, in the 
first phase, the controller proposes a GNN architecture f and performs gradient 
descent on 9 to minimize the loss function #(f(@); Fain), which is computed on 
the batches of training data. In the second phase, the optimized parameter 0* is fixed 
to update the controller parameters @, aiming to maximize the expected reward: 


o* = argmax,,  f~n( fio) ([2(f(0*); Pyatia)]- (17.3) 


Here, 2(f; @) is the controller’s policy parameterized by @ to sample and generate 
GNN architecture f. The reward &(f(0*); Pyatia) is the model performance defined 
by the task of desire, such as the accuracy for the node classification task. Further- 
more, the reward is computed on the validation set, rather than on the training set, 
to encourage the controller to select architectures that generalize well. In most of 
the existing work, the gradient of the expected reward Ep. z(f:@)|4(f(*); Aatia) 
with respect to @ is computed using REINFORCE rule (Sutton et al, 2000). 

Considering GNN-NAS efforts in literature, RL-based search algorithms differ 
in how they represent and train the controller. GraphNAS uses an RNN controller to 
sequentially sample from the multiple architecture dimensions and generate a string 
that encodes a GNN architecture (Gao et al, 2020b). Based on the expected reward 
signaling the quality of the whole architecture, the RNN controller has to optimize 
the sampling policies along all the dimensions. AGNN (Zhou et al, 2019a) is moti- 
vated by an observation that the minor modification to an architecture dimension can 
lead to abrupt change in performance. For example, the graph classification accuracy 
of GNN may be significantly improved by only changing the choice of aggregation 
function from MAX to SUM (Xu et al, 2019d). Based on this observation, AGNN 
proposes a more efficient controller consisted of a series of RNN sub-controllers, 
each corresponding to an independent architecture dimension. At each step, AGNN 
only applies one of the RNN sub-controllers to sample new options from the cor- 
responding dimension, and uses these options to mutate the best architecture found 
so far. By evaluating such a slightly-mutated design, the RNN sub-controller can 
exclude the noises generated from the other architecture dimension modifications, 
and better trains the sampling policy of its own dimension. 
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17.3.4 Differentiable Search 


There are several candidate options along each architecture dimension. For exam- 
ple, for the aggregation function at a particular layer, we have the option of apply- 
ing either a SUM, a MEAN, or a MAX pooling. The common search approaches in 
GNN-NAS, such as random search, evolutionary algorithms, and RL-based search 
methods, treat selecting the best option as a black-box optimization problem over 
a discrete domain. At each search step, they sample and evaluate a single architec- 
ture from the discrete architecture search space. However, such the search process 
towards well-performing GNNs will be very time-consuming since the number of 
possible models is extremely large. Differentiable search algorithms relax the dis- 
crete search space to be continuous, which can be optimized efficiently by gradient 
descent. Specifically, for each architecture dimension, the differentiable search al- 
gorithms usually relax the hard choice from the candidate set into a continuous dis- 
tribution, where each option is assigned with a probability. One example for illus- 
trating the differentiable search along the aggregation function dimension is shown 
in Figure 17.4. At the k-th layer, the node embedding output of aggregation function 
can be decomposed and expressed as: 


Tatna : FEV (i) ULi}), 
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Om represents the m-th aggregation function option, and Œn is the sampling prob- 
ability associated with the corresponding option. The probability distribution along 
a dimension is regularized to have the sum of one. The architecture distribution is 
then formulated by the union probability distribution of all the dimensions. At each 
search step, as shown in Eq.equation 17.4 (with the example of the aggregation 
function dimension), the real operation of a dimension in a new architecture could 
be generated by two different ways: weighted option combination and option sam- 
pling. For the case of weighted option combination, the real operation is represented 
by the weighted average of all candidate options. For the other case, the real opera- 
tion is instead sampled from the probability distribution p(0,,) of the corresponding 
architecture dimension. In both cases, the adopted options are scaled by their sam- 
pling probabilities to support the architecture distribution optimization by gradient 
descent. The architecture distribution is then updated directly by backpropagating 
the training loss at each training step. During the testing, the discrete architecture 
can be obtained by retaining the strongest candidate with the highest probability Œm 
along each dimension. In contrast to black-box optimization, gradient-based opti- 
mization is significantly more data efficient, and hence greatly speeds up the search 
process. 
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Fig. 17.4: One example for illustrating the differentiable search for the aggregation 
function. At a search step, the aggregation function is given by the weighted combi- 
nation of the three candidates, or instead realized by one sampled option (e.g., MAX 
scaled with probability a). Once the search progress terminates, the option with the 
highest probability (e.g., MAX with solid arrow) is used in the final architecture to 
be evaluated on testing set. 


Compared with RL-based search, differentiable search based algorithm is less 
popular in the GNN-NAS literature. PDNAS (Zhao et al, 2020g) relaxes the discrete 
search space into a continuous one by employing the Gumbel-sigmoid, enabling op- 
timization via gradient descent. POSE focuses on searching the propagation func- 
tion, whose discrete search space is relaxed by a softmax approximation. 


17.3.5 Efficient Performance Estimation 


To solve the bi-level optimization problem of AutoGNN, all the above search al- 
gorithms share a common two-stage working pipeline: sampling a new design and 
adjusting the search algorithm based on the performance estimation of the new de- 
sign at each step. Once the search progress terminates, the optimal design with the 
highest model performance will be treated as the desired solution to the concerned 
optimization problem. Therefore, an accurate performance estimation strategy is 
crucial to AutoGNN framework. The simplest way of performance estimation is to 
perform a standard training for each generated design, and then obtain the model 
performance on the split validation set. However, such an intuitive strategy is com- 
putationally expensive given the long search progress and massive graph datasets. 
Parameter sharing is one of the efficient strategies to reduce the cost of perfor- 
mance estimation, which avoids training from scratch for each design. Parameter 
sharing is first proposed in ENAS (Pham et al, 2018) to force all designs to share 
weights to improve efficiency. A new design could be immediately estimated by 
reusing the weights well trained before. However, such a strategy cannot be di- 
rectly adopted in GNN-NAS since the GNN architectures in search space may have 
weights with different dimensions or shapes. To tackle the challenge, recent work 
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modified the parameter sharing strategy to customize for GNNs. GraphNAS (Gao 
et al, 2020b) categorizes and stores the optimized weights based on their shapes, 
and applies the one with the same shape to the new design. After parameter shar- 
ing, AGNN (Zhou et al, 2019a) further uses a few training epochs to fully adapt the 
transferred weights to the new design. In the differentiable GNN-NAS frameworks, 
the parameter sharing is conducted naturally between GNN architectures sharing 
the common computation options (Zhao et al, 2020g; Ding et al, 2020a). 


17.4 Future Directions 


We have reviewed various search spaces and search algorithms. Although some ini- 
tial AutoGNN efforts have been paid, compared with the rapid development of Au- 
toML in computer vision, AutoGNN is still in the preliminary research stage. In this 
section, we discuss several future directions, especially for research on GNN-NAS. 


e Search space. The design of architecture search space is the most important 
portion in GNN-NAS framework. An appropriate search space should be com- 
prehensive by covering the key architecture dimensions and their state-of-the- 
art primitive options to guarantee the performance of searched architecture for 
any given task. Besides, the search space should be compact by incorporating 
a moderate number of powerful options to make the search progress efficient. 
However, most of the existing architecture search spaces are constructed based 
on vanilla GCN and GAT, failing to consider the recent GNN developments. For 
example, graph pooling (Ying et al, 2018c; Gao and Ji, 2019; Lee et al, 2019b; 
Zhou et al, 2020e) has attracted increasing research interests to enable encoding 
the graph structures hierarchically. Based on the wide variety of pooling algo- 
rithms, the corresponding hierarchical GNN architectures gradually shrink the 
graph size and enhance the neighborhood reception field, empirically improving 
the downstream graph analysis tasks. Furthermore, a series of novel graph con- 
volution mechanisms have been proposed from different perspectives, such as 
neighbor-sampling methods to accelerate computation (Hamilton et al, 2017b; 
Chen et al, 201 8c; Zeng et al, 2020a), and PageRank based graph convolutions 
to extend neighborhood size (Klicpera et al, 2019a,a; Bojchevski et al, 2020b). 
With the development in GNN community, it is crucial to update the search 
space to subsume the state-of-the-art models. 

Deep graph neural networks. All the existing search spaces are implemented 
with shallow GNN architectures, i.e., the number of graph convolutional lay- 
ers lgo < 10. Unlike the widely adopted deep neural networks (e.g., CNNs and 
transformers) in computer vision and natural language processing, GNN archi- 
tectures are usually limited with less than 3 layers (Kipf and Welling, 2017b; 
Veličković et al, 2018). As the layer number increases, the node representations 
will converge to indistinguishable vectors due to the recursive neighborhood 
aggregation and non-linear activation (Li et al, 2018b; Oono and Suzuki, 2020). 
Such phenomenon is recognized as the over-smoothing issue (NT and Maehara, 
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2019), which prevents the construction of deep GNNs from modeling the de- 
pendencies to high-order neighbors. Recently, many efforts have been proposed 
to relieve the over-smoothing issue and construct deep GNNs, including em- 
bedding normalization (Zhao and Akoglu, 2019; Zhou et al, 2020d; Ioffe and 
Szegedy, 2015), residual connection (Li et al, 2019c, 2018b; Chen et al, 20201; 
Klicpera et al, 2019a), and random data augmentation (Rong et al, 2020b; Feng 
et al, 2020). However, most of them only achieve comparable or even worse 
performance compared to their corresponding shallow models. By incorporat- 
ing these new techniques into the search space, GNN-NAS could effectively 
combine them and identify the novel deep GNN model, which unleashes the 
deep learning power for graph analytics. 

Applications to emerging graph analysis tasks. One limitation of GNN-NAS 
frameworks in literature is that they are usually evaluated on a few bench- 
mark datasets, such as Cora, Citeseer, and Pubmed for node classification (Yang 
et al, 2016b). However, the graph-structured data is ubiquitous, and the novel 
graph analysis tasks are always emerging in real-world applications, such as 
property prediction of biochemical molecules (i.e., graph classification) (Zitnik 
and Leskovec, 2017; Aynaz Taheri, 2018; Gilmer et al, 2017; Jiang and Bal- 
aprakash, 2020), item/friend recommendation in social networks (i.e., link pre- 
diction) (Ying et al, 2018b; Monti et al, 2017; He et al, 2020), and circuit design 
(i.e., graph generation) (Wang et al, 2020b; Li et al, 2020h; Zhang et al, 2019d). 
The surge of novel tasks poses significant challenges for the future search of 
well-performing architectures in GNN-NAS, due to the diverse data character- 
istics and objectives of tasks and the expensive searching cost. On one hand, 
since the new tasks may do not resemble any of the existing benchmarks, the 
search space has to be re-constructed by considering their specific data charac- 
teristics. For example, in the knowledge graph with informative edge attributes, 
the micro-architecture search space needs to incorporate edge-aware graph con- 
volutional layers to guarantee a desired model performance (Schlichtkrull et al, 
2018; Shang et al, 2019). On the other hand, if the new tasks are similar to the 
existing ones, the search algorithms could re-exploit the best architectures dis- 
covered before to accelerate the search progress in the new tasks. For example, 
one can simply initialize the search progress with these sophisticated archi- 
tectures and uses several epochs to explore the potentially good ones within a 
small region. Especially for the massive graphs with a large volume of nodes 
and edges, the reuse of well-performing architectures from similar tasks could 
significantly save the computation cost. The research challenge is how to quan- 
tify the similarities between the different graph-structured data. 
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Editor’s Notes: Automated graph neural networks introduce automated 
machine learning to tackle the problem of GNN neural architecture search 
and hyperparameter search. Hence, this chapter is orthogonal to most of the 
other chapters in this book, which generally depend on expert experience 
to design specific models and tune hyperparameters. Neural architecture 
search space contains the components of manually designed models, such 
as kinds of aggregators introduced in chapter 4 and chapter 5. Automated 
graph neural networks support common graph analysis tasks, such as node 
classification (chapter 4), graph classification (chapter 9), and link predic- 
tion (chapter 10). 
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Abstract Although deep learning has achieved state-of-the-art performance across 
numerous domains, these models generally require large annotated datasets to reach 
their full potential and avoid overfitting. However, obtaining such datasets can have 
high associated costs or even be impossible to procure. Self-supervised learning 
(SSL) seeks to create and utilize specific pretext tasks on unlabeled data to aid in 
alleviating this fundamental limitation of deep learning models. Although initially 
applied in the image and text domains, recent interest has been in leveraging SSL 
in the graph domain to improve the performance of graph neural networks (GNNs). 
For node-level tasks, GNNs can inherently incorporate unlabeled node data through 
the neighborhood aggregation unlike in the image or text domains; but they can 
still benefit by applying novel pretext tasks to encode richer information and nu- 
merous such methods have recently been developed. For GNNs solving graph-level 
tasks, applying SSL methods is more aligned with other traditional domains, but still 
presents unique challenges and has been the focus of a few works. In this chapter, 
we summarize recent developments in applying SSL to GNNs categorizing them 
via the different training strategies and types of data used to construct their pretext 
tasks, and finally discuss open challenges for future directions. 
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18.1 Introduction 


Recent years have witnessed the great success of applying deep learning in numer- 
ous fields. However, the superior performance of deep learning heavily depends 
on the quality of the supervision provided by the labeled data and collecting a 
large amount of high-quality labeled data tends to be time-intensive and resource- 
expensive (Hu et al, 2020c; Zitnik and Leskovec, 2017). Therefore, to alleviate the 
demand for massive labeled data and provide sufficient supervision, self-supervised 
learning (SSL) has been introduced. Specifically, SSL designs domain-specific pre- 
text tasks that leverage extra supervision from unlabeled data to train deep learning 
models and learn better representations for downstream tasks. In computer vision, 
various pretext tasks have been studied, e.g., predicting relative locations of image 
patches (Noroozi and Favaro, 2016) and identifying augmented images generated 
from image processing techniques such as cropping, rotating and resizing (Shorten 
and Khoshgoftaar, 2019). In natural language processing, self-supervised learning 
has also been heavily utilized, e.g., predicting the masked word in BERT (Devlin 
et al, 2019). 

Simultaneously, graph representation learning has emerged as a powerful strat- 
egy for analyzing graph-structured data over the past few years (Hamilton, 2020). 
As the generalization of deep learning to the graph domain, Graph Neural Networks 
(GNNs) has become one promising paradigm due to their efficiency and strong per- 
formance in real-world applications (You et al, 2021; Zitnik and Leskovec, 2017). 
However, the vanilla GNN model (i.e., Graph Convolutional Network (Kipf and 
Welling, 2017b)) and even more advanced existing GNNs (Hamilton et al, 2017b; 
Xu et al, 2019d, 2018a) are mostly established in a semi-supervised or supervised 
manner, which still requires high-cost label annotation. Additionally, these GNN 
models may not take full advantage of the abundant information in unlabeled data, 
such as the graph topology and node attributes. Hence, SSL can be naturally har- 
nessed for GNNs to gain additional supervision and thoroughly exploit the informa- 
tion in the unlabeled data. 

Compared with grid-based data such as images or text (Zhang et al, 2020e), 
graph-structured data is far more complex due to its highly irregular topology, 
involved intrinsic interactions and abundant domain-specific semantics (Wu et al, 
2021d). Different from images and text where the entire structure represents a single 
entity or expresses a single semantic meaning, each node in the graph is an individ- 
ual instance with its own features and positioned in its own local context. Further- 
more, these individual instances are inherently related with each other, which forms 
diverse local structures that encode even more complex information to be discovered 
and analyzed. While such complexity engenders tremendous challenges in analyz- 
ing graph-structured data, the substantial and diverse information contained in the 
node features, node labels, local/global graph structures, and their interactions and 
combinations provide golden opportunities to design self-supervised pretext tasks. 

Embracing the challenges and opportunities to study self-supervised learning 
in GNNs, the works (Hu et al, 2020c, 2019c; Jin et al, 2020d; You et al, 2020c) 
have been the first research that systematically design and compare different self- 
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supervised pretext tasks in GNNs. For example, the works (Hu et al, 2019c; You 
et al, 2020c) design pretext tasks to encode the topological properties of a node 
such as centrality, clustering coefficient, and its graph partitioning assignment, or 
to encode the attributes of a node such as individual features and clustering assign- 
ments in embeddings output by GNNs. The work (Jin et al, 2020d) designs pretext 
tasks to align the pairwise feature similarity or the topological distance between 
two nodes in the graph with the closeness of two nodes in the embedding space. 
Apart from the supervision information employed in creating pretext tasks, design- 
ing effective training strategies and selecting reasonable loss functions are another 
crucial components in incorporating SSL into GNNs. Two frequently used training 
strategies that equip GNNs with SSL are 1) pre-training GNNs through complet- 
ing pretext task(s) and then fine-tuning the GNNs on downstream task(s), and 2) 
jointly training GNNs on both pretext and downstream tasks (Jin et al, 2020d; You 
et al, 2020c). There are also few works (Chen et al, 2020c; Sun et al, 2020c) ap- 
plying the idea of self-training in incorporating SSL into GNNs. In addition, loss 
functions are selected to be tailored for purposes of specific pretext tasks, which in- 
cludes classification-based tasks (cross-entropy loss), regression-based tasks (mean 
squared error loss) and contrastive-based tasks (contrastive loss). 

In view of the substantial progress made in the field of graph neural networks 
and the significant potential of self-supervised learning, this chapter aims to present 
a systematic and comprehensive review on applying self-supervised learning into 
graph neural networks. The rest of the chapter is organized as follows. Section 18.2 
first introduces self-supervised learning and pretext tasks, and then summarizes fre- 
quently used self-supervised methods from the image and text domains. In Sec- 
tion 18.3, we introduce the training strategies that are used to incorporate SSL 
into GNNs and categorize the pretext tasks that have been developed for GNNs. 
Section 18.4 and 18.5 present detailed summaries of numerous representative SSL 
methods that have been developed for node-level and graph-level pretext tasks. 
Thereafter, in Section 18.6 we discuss representative SSL methods that are devel- 
oped using both node-level and graph-level supervision, which we refer to as node- 
graph-level pretext tasks. Section 18.7 collects and reinforces the major results and 
the insightful discoveries in prior sections. Concluding remarks and future forecasts 
on the development of SSL in GNNs are provided in Section 18.8. 


18.2 Self-supervised Learning 


Supervised learning is the machine learning task of training a model that maps an 
input to an output based on the ground-truth input-output pairs provided by a la- 
beled dataset. Good performance of supervised learning requires a decent amount 
of labeled data (especially when using deep learning models), which are expen- 
sive to manually collect. Conversely, self-supervised learning generates supervisory 
signals from unlabeled data and then trains the model based on the generated super- 
visory signals. The task used for training the model based on the generative signal is 
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referred to as the pretext task. In comparison, the task whose ultimate performance 
we care about the most and expect our model to solve is referred to as the down- 
stream task. To guarantee the performance benefits from self-supervised learning, 
pretext tasks should be carefully designed such that completing them encourages 
the model to have the similar or complementary understanding as completing down- 
stream tasks. Self-supervised learning initially originated to solve tasks in image and 
text domains. The following part focuses on introducing self-supervised learning in 
these two fields with the specific emphasis on different pretext tasks. 

In computer vision (CV), many ideas have been proposed for self-supervised rep- 
resentation learning on image data. A common example is that we expect that small 
distortion on an image does not affect its original semantic meaning or geometric 
forms. The idea to create surrogate training datasets with unlabeled image patches 
by first sampling patches from different images at varying positions and then distort- 
ing patches by applying a variety of random transformations are proposed in (Doso- 
vitskiy et al, 2014). The pretext task is to discriminate between patches distorted 
from the same image or from different images. Rotation of an entire image is an- 
other effective and inexpensive way to modify an input image without changing 
semantic content (Gidaris et al, 2018). Each input image is first rotated by a mul- 
tiple of 90 degrees at random. The model is then trained to predict which rotation 
has been applied. However, instead of performing pretext tasks on an entire image, 
the local patches could also be extracted to construct the pretext tasks. Examples of 
methods using this technique include predicting the relative position between two 
random patches from one image (Doersch et al, 2015) and designing a jigsaw puz- 
zle game to place nine shuffled patches back to the original locations (Noroozi and 
Favaro, 2016). More pretext tasks such as colorization, autoencoder, and contrastive 
predictive coding have also been introduced and effectively utilized (Oord et al, 
2018; Vincent et al, 2008; Zhang et al, 2016d). 

While computer vision has achieved amazing progress on self-supervised learn- 
ing in recent years, self-supervised learning has been heavily utilized in natural lan- 
guage processing (NLP) research for quite a while. Word2vec (Mikolov et al, 2013b) 
is the first work that popularized the SSL ideas in the NLP field. Center word pre- 
diction and neighbor word prediction are two pretext tasks in Word2vec where the 
model is given a small chunk of the text and asked to predict the center word in that 
text or vice versa. BERT (Devlin et al, 2019) is another famous pre-trained model 
in NLP where two pretest tasks are to recover randomly masked words in a text or 
to classify whether two sentences can come one after another or not. Similar works 
have also been introduced, such as having the pretext task classify whether a pair 
of sentences are in the correct order (Lan et al, 2020), or a pretext task that first 
randomly shuffles the ordering of sentences and then seeks to recover the original 
ordering (Lewis et al, 2020). 

Compared with the difficulty of data acquisition encountered in image and text 
domains, machine learning in the graph domain faces even more challenges in ac- 
quiring high-quality labeled data. For example, for molecular graphs it can be ex- 
tremely expensive to perform the necessary laboratory experiments to label some 
molecules (Rong et al, 2020a), and in a social network obtaining ground-truth labels 
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for individual users may require large-scale surveys or be unable to be released due 
to privacy agreements/concerns (Chen et al, 2020a). Therefore, the success achieved 
by applying SSL in CV and NLP naturally leads the question as to whether SSL 
can be effectively applied in the graph domain. Given that graph neural network is 
among the most powerful paradigms for graph representation learning, in follow- 
ing sections we will mainly focus on introducing self-supervised learning within 
the framework of graph neural networks and highlighting/summarizing these recent 
advancements. 


18.3 Applying SSL to Graph Neural Networks: Categorizing 
Training Strategies, Loss Functions and Pretext Tasks 


When seeking to apply self-supervised learning to GNNs, the major decisions to 
be made are how to construct the pretext tasks, which includes what information to 
leverage from the unlabeled data, what loss function to use, and what training strat- 
egy to use for effectively improving the GNN’s performance. Hence, in this section 
we will first mathematically formalize the graph neural network with self-supervised 
learning and then discuss each of the above. More specifically, we will introduce 
three training strategies, three loss functions that are frequently employed in the cur- 
rent literature, and categorize current state-of-the-art pretext tasks for GNNs based 
on the type of information they leverage for constructing the pretext task. 

Given an undirected attributed graph Y¥ = {V,&,X}, where V = {v1,...,v)y\} 
represents the vertex set with |/| vertices, & represents the edge set and e;; = (vi, vj) 
is an edge between node v; and vj, X € RI” x4 represents the feature matrix and 
x; = X[i,:]' € Rf is the d-dimensional feature vector of the node v;. A € RI” 
is the adjacency matrix where Aj; = 1 if e;; € £ and Ajj = 0 if eij € £. We denote 
any GNN-based feature extractor as fo : RIMx4 x RIXIY y RIV xa parametrized 
by 0, which takes any node feature matrix X and the graph adjacency matrix A 
and outputs the d’-dimensional representation for each node Zenn = fo(X,A) € 
R!”|*4", which is further fed into any permutation invariant function READOUT : 
RI”|*4' _5 R? to obtain the graph embeddings ZGNN.g = READOUT(fo@(X,A)) € 
R”. More specifically, we note that here @ represents the parameters encoded in 
the corresponding network architectures of the GNN (Hamilton et al, 2017b; Kipf 
and Welling, 2017b; Petar et al, 2018; Xu et al, 2019d, 2018a). Considering the 
transductive semi-supervised tasks where we are provided with the labeled node 
set ⁄ C Y, the labeled graph Y, the associated node label matrix Ysup € RIXE 
and the graph label Ysupy € R’ with label dimension /, we aim to classify nodes and 
graphs. The node and graph representations output by GNNs are firstly processed by 
the extra adaptation layer hg „ parametrized by the supervised adaptation parameter 


‘sup 


Osup to obtain the predicted /-dimensional node label Zsup € RI”|*! and graph label 
Zsupg © R! by Eq. equation 18.1-equation 18.2. Then the model parameters @ in 
GNN-based extractor fg and the parameters Osup in adaptation layer hg, are learned 
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by optimizing the supervised loss calculated between the output/predicted label and 
the true label for labeled nodes and the labeled graph, which can be formulated as: 


Zsup = houp (Jo (X,A)) (18.1) 


Zsup 7 = ho, (READOUT (fo (X ,A))) (18.2) 


: 1 

arg min rya y £sup (Zsup, is Ysup,i) 

a, TTB, 
—_$_—_—_ 
0*, ae = arg min Loup (8, Osup) = Node supervised task , (18.3) 
7 Sup arg min Lsup (Zsup.9 ,Ysup,¥ 
9, 8sup 
——— 
Graph supervised task 


where Lup is the total supervised loss function and ¢syp is the supervised loss 
function for each example, Ysup,i = Ysup li, a indicates the true label for node v; 
in node supervised task and Y.up,y indicates the true label for graph Y in graph 
supervised task. Their corresponding predicted label distributions are denoted as 
Zsup,i = Zsup li, JT and Zsup,g- 0, Osup are parameters to be optimized for any GNN 
model and the extra adaptation layer for the supervised downstream task, respec- 
tively. Note that for ease of notation, we assume the above graph supervised task 
is operated only on one graph but the above framework can be easily adapted to 
supervised tasks on multiple graphs. 


18.3.1 Training Strategies 


In this chapter, we view SSL as the process of designing a specific pretext task and 
learning the model on the pretext task. In this sense, SSL can either be used as 
unsupervised pre-training or be integrated with semi-supervised learning. 

The model capability of extracting features for completing pretext and down- 
stream tasks is improved through optimizing the model parameters @, 6,51, and Osup, 
where 6,5) denotes the parameters of the adaptation layer for the pretext task. In- 
spired by relevant discussions (Hu et al, 2019c; Jin et al, 2020d; Sun et al, 2020c; 
You et al, 2020b,c), we summarize three possible training strategies that are pop- 
ular in the literature to train GNNs in the self-supervised setting as self-training, 
pre-training with fine-tuning, and joint training. 
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18.3.1.1 Self-training 


Self-training is a strategy that leverages the supervision information in the train- 
ing process generated by the model itself (Li et al, 2018b; Riloff, 1996). A typical 
self-training pipeline begins with first training the model over the labeled data, then 
generating pseudo labels to unlabeled samples that have highly confident predic- 
tions, and including them into the labeled data in the next round of training. In this 
way, the pretext task is the same as the downstream task by utilizing the pseudo 
labels for some of the originally unlabeled data. A detailed overview is presented in 
Fig. 18.1 where the prediction results are re-utilized to augment the training data in 
the next iteration as done in (Sun et al, 2020c). 


Back-propagation 


A GNN-based Task-specific 

(2}— ——- 

XY / | Feature Extractor Adaptation AE 
ou 


Augmenting the training data by the 
top confident predictions 


Forward prediction 


Fig. 18.1: An overview of GNNs with SSL using self-training. 


18.3.1.2 Pre-training and Fine-tuning 


A common strategy to utilize features learned from completing pretext tasks in- 
cludes applying the optimized parameters from self-supervision as initialization for 
fine-tuning in downstream tasks. This strategy consists of two stages: pre-training 
on the self-supervised pretext tasks and fine-tuning on the downstream tasks. The 
overview of this two-stage optimization strategy is given in Fig. 18.2. 

The whole model consists of one shared GNN-based feature extractor and two 
adaptation modules, one for the pretext task and one for the downstream task. In the 
pre-training process, the model is trained with the self-supervised pretext task(s) as: 


Zssl =he,, (fo (X,A)), (18.4) 


Zss,¢ = ho, (READOUT (fo (X,A))), (18.5) 
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arg n min oyj > L LssI (Zssl,i; Yssl,i) 
Oss1 | | vev 
. ; 
0*0” s1 = arg min -Zs (0, Oss1) = Node pretext tasks , (18.6) 
»Ussl arg min lssi (Zssl > Ysg) 


SS 
N, 


Graph pretext tasks 


where Ossı denotes the parameters of the adaptation layer hg, for the pretext 
tasks, sı is the self-supervised loss function for each example, and -sı is the 
total loss function of completing the self-supervised task. In node pretext tasks, 
Zssli = Zesi(i, :]' and Yssi,i = Yoo |i, :]', which are the self-supervised predicted and 
true label(s) for the node v;, respectively. In graph pretext tasks, Zssı and yss, are 
the self-supervised predicted and true label(s) for the graph Y, respectively. Then, in 
the fine-tuning process, the feature extractor fg is trained by completing downstream 
tasks in Eq. equation 18.1-equation 18.3 with the pre-trained 0* as the initialization. 
Note that to utilize the pre-trained node/graph representations the fine-tuning pro- 
cess can also be replaced by training a linear classifier (e.g., Logistic Regression 
(Peng et al, 2020; Veličković et al, 2019; You et al, 2020b; Zhu et al, 2020c)). 


Pre-training 
i f] | Ossi | Back-propagation 


i ; A aS a a 
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Fig. 18.2: An overview of GNNs with SSL using pre-training and fine-tuning. 


18.3.1.3 Joint Training 


Another natural idea to harness self-supervised learning for graph neural networks is 
to combine losses of completing pretext task(s) and downstream task(s) and jointly 
train the model. The overview of the joint training is shown in Fig. 18.3. 

The joint training consists of two components: feature extraction by a GNN and 
adaption processes for both the pretext tasks and downstream tasks. In the feature 
extraction process, a GNN takes the graph adjacency matrix A and the feature ma- 
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trix X as input and outputs the node embeddings Zgyn and/or graph embeddings 
ZGnn.g- In the adaptation procedure, the extracted node and graph embeddings are 
further transformed to complete pretext and downstream tasks via hg, and hg,,,, 
respectively. We then jointly optimize the pretext and downstream task losses as: 


Zsup = he,,, (fo(X,A)), Zssl = he, (fo(X,A)), (18.7) 


Zsup,9 = eup (READOUT(f@(X,A))),  Zssi,g = ho, (READOUT(fo(X,A))), 


(18.8) 
. 1 
arg min 2 (01 sup (Zsup, i, Ysup, i) + an essi (Zss1, i» Yssl, i)) 
8, Osup Ass KA vjEeV 
0 m 0 ve i = Node pretext tasks i 
arg mn 1 sup (Zsup,¥ > Ysup.¥) + Op bss) (Zs, Ysg) 
19sup: ssl 


Graph pretext tasks 


(18.9) 
where ;, œ € R > 0 are the weights for combining the supervised loss sup and the 
self-supervised loss sg). 


i | Back-propagation 
Oss 
jor a — GNN-based Task-specific 
e j T Feature Extractor Adaptation Pretext Tasks | 
os | 
Task-specific 
—— Kaeo | | Downstream Tasks 


sup | Back-propagation| 


D 


Fig. 18.3: An overview of GNNs with SSL using joint training. 


18.3.2 Loss Functions 


A loss function is used to evaluate the performance of how well the algorithm mod- 
els the data. Generally, in GNNs with self-supervised learning, the loss function for 
the pretext task has three forms, which are classification loss, regression loss and 
contrastive learning loss. Note that the loss functions we discuss here are only for 
the pretext tasks rather than downstream tasks. 
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18.3.2.1 Classification and Regression Loss 


In completing classification-based pretext tasks such as node clustering where node 
embeddings are expected to encode the assignment information of the clusters, the 
objective for the pretext is to minimize the following loss function: 


1 


Fa $ lcElZssliYssl,i) = =p Erer Liat! ssi = 1)log(Zss1,i;) 


vieV 
Lesh oa Node pretext tasks ) 
lcE (Zs, Yss1,9) = -5j 1 (Ysi = 1) log (Žssgj) 
<x— —_—__ 
Graph pretext tasks 
(18.10) 
where lcg indicates the cross entropy function, Zssı; and Zs.) gy represents the pre- 
dicted label distribution of node v; and graph % for the pretext task, and their cor- 
responding class probability distribution Z,,); and Žssı are calculated by softmax 
normalization, respectively. For example, Zs ;; is the probability of node v; belong- 
ing to class j. Since every node v; has its own pseudo label (i.e., Yssi,;) in completing 
pretext tasks, we can consider all the nodes Y in the graph compared to only the 
labeled set of nodes % as before in downstream tasks. 
In completing regression-based pretext tasks, such as feature completion, the 
mean squared error loss is typically used as the loss function: 


B 
\Y| 


L msg (Zssl,i» Yssl,i) = r Liver ||Zssh,i ae Yssiill? 
vey 
SS mm 
Lys) = Node pretext tasks , (18.11) 
ÉMSE (Zss,9 Yssl,) T ||Zss1,g = Ysg? 


Graph pretext tasks 


where the objective is minimizing the distance from our learned embedding to Yss1,i 
which represents any ground-truth value of node v;, such as the original attribute in 
the feature completion or other values of node vj. 


18.3.2.2 Contrastive Learning Loss 


Inspired by the significant progress achieved by employing the contrastive learning 
in natural language processing and computer vision (Le-Khac et al, 2020), recent 
studies (Hassani and Khasahmadi, 2020; Veličković et al, 2019; You et al, 2020b; 
Zhu et al, 2020c, 2021) propose similar contrastive frameworks to enable SSL in 
GNNs. The general goal of contrastive learning in GNNs is to train GNN-based en- 
coders such that the agreement of representations between similar graph instances 
(e.g., multiple views generated from the same instance) is maximized while the 
agreement between dissimilar graph instances (e.g., multiple views generated from 
different instances) is minimized. Such maximization and minimization of agree- 
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Fig. 18.4: An overview of GNNs with SSL using contrastive learning. 


ments between different views of instances is typically formalized as maximizing 


the mutual information .% (Z Z. ) between representations 2i and Zi under two 
different views as: 


ssl 


max f (Zl, Z%), (18.12) 


ssl? 
7Ussl 


where Zo correspond to representations output from any GNN-based encoder 
followed by an adaptation layer hg „ under two different graph views ¥ 1 g2, 

In order to computationally estimate and maximize the mutual information that 
is originally intractable to be exactly computed in most cases (Belghazi et al, 2018; 
Gabrié et al, 2019; Paninski, 2003; Xie et al, 2021), multiple estimators to eval- 
uate the lower bounds to the mutual information are derived, including normal- 
ized temperature-scaled cross-entropy (NT-Xent) (Chen et al, 20201), Donsker- 
Varadhan representation of the KL-divergence (Donsker and Varadhan, 1976), 
noise-contrastive estimation (InfoNCE) gutmann2010noise, Jensen-Shannon esti- 
mator (Nowozin et al, 2016). For simplicity, here we only present one frequently 
used mutual information estimator NT-Xent, which is formalized as: 


Los = + L énr-xent (Zia Za PF”) 
PH (Jerr 
1 exp( A(z! A p 
=-—— È ig PP (Zs si) (18.13) 
|P lager LY exp(D (251 Zs) 


ke{jUP; } 

h 2( 1 zi )) sim(zl j Zea.) 
where Zo) i> Žssl,j a T 

the similarity function (i.e., cosine similarity) and the temperature factor t, AT 

represents the set of all pairs of positive samples while Y~ = Uii, evr P; repre- 


is a learnable discriminator parametrized with 


sents all sets of negative samples. Especially 27 contains all negative samples of 
the sample i. Note that we can contrast both node representations, graph represen- 
tations and node-graph representations under different views. Therefore, Zs is not 
limited to the node embeddings, but could refer to the embeddings of both node and 
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graph under the first graph view Y!. Thus, i, j,k could refer to both node and graph 
samples. 
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Fig. 18.5: A categorization of pretext tasks in self-supervised learning. ! 


18.3.3 Pretext Tasks 


Pretext tasks are constructed by leveraging different types of supervision informa- 
tion coming from different components of graphs. Based on the components that 
generate the supervision information, pretext tasks that are prevalent in the litera- 
ture are categorized into node-level, graph-level and node-graph level. In completing 
node-level and graph-level pretext tasks, three types of information can be lever- 
aged: graph structure, node features, or hybrid, where the latter combines the infor- 


l Additional summary details and the corresponding code links for these methods can be found at 
https://github.com/NDS-VU/GNN-SSL-Chapter. 
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mation from node features, graph structure, and even information from the known 
training labels (as presented in (Jin et al, 2020d)). We summarize the categorization 
of pretext tasks as a tree where each leaf node represents a specific type of pretext 
tasks in Fig. 18.5 while also including the corresponding references. In the next 
three sections, we give detailed explanations about each of these pretext tasks and 
summarize the majority of existing methods. 


18.4 Node-level SSL Pretext Tasks 


For node-level pretext tasks, methods have been developed to use easily-accessible 
data to generate pseudo labels for each node or relationships for each pair of nodes. 
In this way, the GNNs are then trained to be predictive of the pseudo labels or to keep 
the equivalence between the node embeddings and the original node relationships. 


18.4.1 Structure-based Pretext Tasks 


Different nodes have different structure properties in graph topology, which can be 
measured by the node degree, centrality, node partition, etc. Thus, for structure- 
based pretext tasks at the node-level, we expect to align node embeddings extracted 
from the GNNs with their structure properties, in an attempt to ensure this informa- 
tion is preserved while GNNs learn the node embeddings. 

Since degree is the most fundamental topological property, Jin et al (2020d) de- 
signs the pretext task to recover the node degree from the node embeddings as fol- 
lows: 


1 


Les = y Y ease (Zssi,ir di) (18.14) 
| vev 
where d; represents the degree of node i and 25.) ; = Zssl re denotes the self- 


supervised GNN embeddings of node i. It should be noted that this pretext task 
can be generalized to harness any structural property in the node level. 

Node centrality measures the importance of nodes based on their structure roles 
in the whole graph (Newman, 2018). Hu et al (2019c) designs a pretext task to have 
GNNs estimate the rank scores of node centrality. The specific centrality measures 
considered are eigencentrality, betweenness, closeness, and subgraph centrality. For 
a node pair (u,v) and a centrality score s, with relative order R}, = 1(s, > sy) 
where Ri, y = 1 if su > sy and Ruy =O if Su < sy, a decoder pe for centrality score 
s estimates its rank score by S, = D’"*(zgnn.y). The probability of estimated rank 
order is defined by the sigmoid function Ñ$ „, = TaS 
relative order between pairs of nodes could be formalized as a binary classification 
problem with the loss: 


. Then predicting the 
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Legh = -} y ( Ri vlog Ri v + (i= u, Ri) log(1 — Ri) (18.15) 


S uvev 


Different from peer works, Hu et al (2019c) does not consider any node feature but 
instead extract the node features directly from the graph topology, which includes: 
(1) degree that defines the local importance of a node; (2) core-number that defines 
the connectivity of the subgraph around a node; (3) collective influence that defines 
the neighborhood importance of a node; and (4) local clustering coefficient, which 
defines the connectivity of 1-hop neighborhood of a node. Then, the four features 
(after min-max normalization) are concatenated with a nonlinear transformation and 
fed into the GNN where (Hu et al, 2019c) uses the pretext tasks: centrality ranking, 
clustering recovery and edge prediction. Another innovative idea in (Hu et al, 2019c) 
is to choose a fix-tune boundary in the middle layer of GNNs. The GNN blocks 
below this boundary are fixed, while the ones above the boundary are fine-tuned. For 
downstream tasks that are closely related to the pre-trained tasks, a higher boundary 
is used. 

Another important node-level structural property is the partition each node be- 
longs after performing a graph partitioning method. In (You et al, 2020c), the pretext 
task is to train the GNNs to encode the node partition information. Graph partition- 
ing is to partition the nodes of a graph into different groups such that the number 
of edges between each group is minimized. Given the node set V, the edge set &, 
and a preset number of partitions p € [1,|¥|], a graph partitioning algorithm (e.g., 
(Karypis and Kumar, 1995) as used in (You et al, 2020c)) will output a set of nodes 
{Phar Par, |Yoar, C V,i=1,..., p}. Then the classification loss is set exactly the 
same as: 


Les = y £cr( (Zss1isYssl,i) (18.16) 


JA viEV 


where Zssı; denotes the embedding of node v; and assuming that the partitioning 
label is a one-hot encoding Yssı; € R? with k-th entry as 1 and others as 0 if v; € 
Parsi =1,..., |7|, 3k € (1, p]. 


18.4.2 Feature-based Pretext Tasks 


Node features are another important information that can be leveraged to provide ex- 
tra supervision. Since the state-of-the-art GNNs suffer from over-smoothing (Chen 
et al, 2020c), the original feature information is partially lost after fed into the 
GNNs. In order to reduce the information loss in node embeddings, the pretext task 
in (Hu et al, 2020c; Jin et al, 2020d; Manessi and Rozza, 2020; Wang et al, 2017a; 
You et al, 2020c) is to first mask node features and let the GNN predict those fea- 
tures. More specifically, they randomly mask input node features by replacing them 
with special mask indicators and then apply GNNs to obtain the corresponding node 
embeddings. Finally a linear model is applied on top of embeddings to predict the 
corresponding masked node features. Assuming the set of nodes that are masked is 
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Vm, then the self-supervised regression loss to reconstruct these masked features is: 
1 


Lysl = KA 


$ MmselZssl,i Xi) (18.17) 
viE Im 
To handle the high sparsity of the node features, it is beneficial to first perform 
feature dimensionality reduction on X (such as principal component analysis (PCA) 
used in (Jin et al, 2020d)). Additionally, instead of reconstructing node features, 
node embeddings could also be reconstructed from their corrupted version, such as 
in (Manessi and Rozza, 2020). 

Contrary to the graph partitioning where nodes are grouped by the graph topol- 
ogy, in graph clustering the clusters of nodes are discovered based on their fea- 
tures (You et al, 2020c). In this way the pretext task can be designed to recover the 
node clustering assignment. Given the node set V, the feature matrix X, and a preset 
number of clusters p € [1,|¥%|] (or without if the clustering algorithm automatically 
learns the number of clusters) as input, the clustering algorithm will output a set of 
node clusters {Vou,,---; Yeu, |V%eu; C i= 1,..., p} and assuming for node vj, the 
partitioning label is a one-hot encoding Yss1,; € RP with k-th entry as 1 and others 
as 0 if v; E Kupi = 1,...,|Y%|, 3k € [1, p]. Then the loss is the same as Eq. equa- 
tion 18.16. 

Instead of focusing on individual nodes, pretext tasks have also been developed 
based on the relationship between pairs of nodes (Jin et al, 2021, 2020d). The basic 
idea is to retain the node pairwise feature similarity in the node embeddings from 
GNNs. Suppose .%,.% denote the sets of node pairs having the highest and the 
lowest similarity: 


T; = { (vi, v;)| sim(x;,x;) in top-B of {sim(x;, xp) }1 \sim(x;,x;), Vv; Er}, (18.18) 


Ta = {(vi,v;)| sim(x;,x;) in bottom-B of {sim(x;, xp) } i \sim(x;, x;), Vv; € Y}, (18.19) 


where sim(x;,x;) measures the cosine similarity of features between two nodes v;, vj 
and B is the number of top/bottom pairs selected for each node. Then the pretext task 
is to optimize the following regression loss: 


1 
Loy = > — 2 ÊMSE (fw (ZGNN i = ZGNN, j|),Sim(Xi,X;)), en) 
|ZU Ta CADEAS 


where fw is a function mapping the difference between two node embeddings from 
GNNs to a scalar representing the similarity between them. 
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18.4.3 Hybrid Pretext Tasks 


Instead of employing only the topology or only the feature information as the extra 
supervision, some pretext tasks combine them together as the hybrid supervision, or 
even utilize information from the known training labels. 

A contrastive framework for unsupervised graph representation learning, GRACE, 
where two correlated graph views are generated by randomly performing corrup- 
tion on attributes (masking node features) and topology (removing or adding graph 
edges) is proposed in (Zhu et al, 2020c). Then the GNNs are trained using a con- 
trastive loss to maximize the agreement between node embeddings in these two 
views. In each iteration two graph views Y! = {A!,X!} and Y = {A?,X*} are 
generated randomly according to the possible augmentation functions from an input 
graph Y = {A,X}. 

The objective is to maximize the similarity of the same nodes in different views of 
the graph while minimizing the similarity of different nodes in the same or different 
views of the graph. Thus, if we denote the node embeddings in the two views as 
Zbnn = fo (X1,A!), Zyn = fo(X7,A”), then the contrastive NT-Xent loss is: 


L ént-xent(ZGnn>ZGNN? PF), (18.21) 
(i v2) Pt 
where Z+ includes positive pairs of (v!,v?) where v!,v? correspond to the same 
node in different views, while Y~ = Ut v2) pt Pa represents all sets of negative 
samples with Pi containing nodes different from v; in the same view (intra-view 
negative pairs) or the other view (inter-view negative pairs). 

More specifically, in the above, the two graph corruptions are removing edges 
and masking node features. In removing edges, a random masking matrix M € 
{0, 141x17 | is randomly sampled whose entry is drawn from a Bernoulli distri- 
bution Mj; ~ A(1 — pr) if Aij = 1 for the original graph. p, is the probability of 
each edge being removed. The resulting matrix can be computed as A’ = A © M 
creating the adjacency matrix of graph view Y ' from Z. 

In masking node features, a random vector m € {0,1} is utilized, where each 
dimension of m is independently drawn from a Bernoulli distribution with probabil- 
ity 1 — pm and d is the dimension of the node features X. Then, the generated node 
features X’ for graph view GY from F is computed by: 


X! = [xı Om;x2 Om; - +: ;x)y) Om], (18.22) 


where |;] is the concatenation operator. Moreover, a modified version of the GRACE 
is proposed in (Zhu et al, 2021) where the whole contrastive procedure is the same as 
GRACE except that the graph augmentation is adaptively performed based on the 
importance of nodes and edges. Specifically, the probability of removing an edge 
between nodes v;,v; should reflect the importance of the edge (v;,v;) such that the 
augmentation function is more likely to corrupt unimportant edges while keeping 
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important connective structures intact in augmented views. Similarly the feature 
dimensions frequently appearing in influential nodes are seen as important and so 
are masked with lower probability. 

The observation made in (Chen et al, 2020b) that nodes with further topological 
distance to the labeled nodes are more likely to be misclassified indicates the un- 
even distribution of the ability of GNNs to embed node features in the whole graph. 
However, existing graph contrastive learning methods ignore this uneven distribu- 
tion, which motivates Chen et al (2020b) to propose the distance-wise graph con- 
trastive learning (DwGCL) method that can adaptively augment the graph topology, 
sample the positive and negative pairs, and maximize the mutual information. The 
topology information gain (TIG) is calculated based on Group PageRank and node 
features to describe the task information effectiveness that the node obtains from 
labeled nodes along the graph topology. By ranking the performance of GNNs on 
nodes according to their TIG values with/without contrastive learning, it is found 
that contrastive learning mainly improves the performance on nodes that are topo- 
logically far away from the labeled nodes. Based on the above finding, Chen et al 
(2020b) propose to: 1) perturb the graph topology by augmenting nodes according 
to their TIG value; 2) sampling the positive and negative pairs considering local/- 
global topology distance and node embedding distance; and 3) assigning different 
weights to nodes in the self-supervised loss based on their TIG rankings. Results 
demonstrate the performance improvement of this distance-wise graph contrastive 
learning over the typical contrastive learning approach. 

Another special supervision information to exploit is the prediction results of 
the model itself. Sun et al (2020c) leverages the multi-stage training framework 
to utilize the information of the pseudo labels generated by predictions in the next 
rounds of training. The multi-stage training algorithm repeatedly adds the most con- 
fident predictions of each class to the label set and re-utilizes these pseudo labeled 
data to train the GNNs. Furthermore, a self-checking mechanism based on Deep- 
Cluster (Caron et al, 2018) is proposed to guarantee the precision of labeled data. 
Assuming that the cluster assignment for node v; is c; € {0, 1}? (here the number of 
clusters is assumed to equal to the number of predefined classes p in the downstream 
classification task) and the centroid matrix C € R!’ *P represents the feature of each 
cluster, then we obtain the cluster assignment c; for each node v; by optimizing: 


. 1l 2 T 
aiy Ly „gin, zanni — Ceilla, st. clp=1. (18.23) 


After applying DeepCluster to group nodes into multiple clusters, an aligning 
mechanism is used to assign nodes in each cluster to their corresponding class de- 
fined by downstream tasks. For each cluster k € [1, p] in unlabeled data, the compu- 
tation of aligning mechanism is: 


c* = argmin ||Kk — Hm|”, (18.24) 
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where Um denotes the centroid of class m in labeled data, Ką denotes the centroid 
of cluster k in unlabeled data and c* represents the aligned class that has the clos- 
est distance to the centroid K; of the cluster k among all centroids of classes in the 
original labeled data. Note that the self-checking can be directly performed by com- 
paring the distance of each unlabeled node to centroids of classes in labeled data. 
However, directly checking in this naive way is very time-consuming. 


18.5 Graph-level SSL Pretext Tasks 


After having just presented the node-level SSL pretext tasks, in this section we focus 
on the graph-level SSL pretext tasks where we desire the node embeddings coming 
from the GNNs to encode information of graph-level properties. 


18.5.1 Structure-based Pretext Tasks 


As the counterpart of the nodes in the graph, the edges encode abundant information 
of the graph, which can also be leveraged as an extra supervision to design pretext 
tasks. The pretext task in (Zhu et al, 2020a) is to recover the graph topology, i.e., 
predict edges, after randomly removing edges in the graph. After node embeddings 
ZGNN.i iS obtained for each node v;, the probability of the edge between any pair of 
nodes v;, vj is calculated by their feature similarity as follows: 


Aj, = sigmoid (Zann; (Zann, j) ), (18.25) 
and the weighted cross-entropy loss is used during training, which is defined as: 


Ls = — L W (AijlogA;;) +(1—A;;j)log(1 —A;;), (18.26) 


Vi,Vj EV 


where W is the weight hyperparameter used for balancing two classes; which are 
node pairs having an edge and node pairs without an edge between them. 

As it is known that unclean graph structure usually impedes the applicability of 
GNNs (Cosmo et al, 2020; Jang et al, 2019). A method that trains the GNNs by 
downstream supervised tasks based on the cleaned graph structure reconstructed 
from completing a self-supervised pretext task is introduced in (Fatemi et al, 2021). 
The self-supervised pretext task aims to train a separate GNN to denoise the cor- 
rupted node feature X generated by either randomly zeroing some dimensions of 
the original node feature X when having binary features or by adding independent 
Gaussian noise when X is continuous. Two methods are used to generate the initial 
graph adjacency matrix A. The first method Full Parametrization (FP) treats every 
entry inA asa parameter and directly optimizes its |/|* parameters by denoising the 
corrupted feature X. The second method MLP-kNN considers a mapping function 
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kNN(MLP(X)), where a multilayer perceptron (i.e., MLP(-)) updates the original 
node features and kNN(-) produces a sparse matrix by selecting top-k similar nodes 
to each node and adds edges between them. Then, the generated initial adjacency 
matrix A is normalized and symmetrized into a new adjacency matrix A as follows: 


BX). BIJAT 
Aap iP tPA) p- (18.27) 


2 > 

where P is a function with a non-negative range to ensure the positivity of every 
entry in A. In MLP-kKNN method, P is the element-wise ReLU function. However, 
the ReLU function could result in the gradient flow problem in the FP method, thus 
the element-wise ELU function followed by an addition of 1 to avoid the problem 
of gradient flow is used instead. Next, a separate GNN-based encoder takes noisy 
node features X and the new normalized adjacency matrix A as input and output the 
updated node features Z = GNN(X,A). The parameters in FP and MLP-kNN used 
for generating the initial adjacency matrix A is optimized by: 


1 
fs = i $} use (xi, 2i), (18.28) 


vie Vin 


where 2; = Z[i,:]' is the noisy embedding vector of the node v; obtained by the 
separate GNN-based encoder. The optimized parameters in FP and MLP-kNN leads 
to the generation of more cleaned graph adjacency matrix, which in turn results in 
the better performance in the downstream tasks. 

In addition to the graph edges and the adjacency matrix, topological distance 
between nodes is another important global structural property in graph. The pretext 
task in (Peng et al, 2020) is to recover the topological distance between nodes. More 
specifically, they leverage the shortest path length between nodes denoted as pj; 
between nodes v; and vj, but this could be replaced with any other distance measure. 
Then, they define the set gk as all the nodes having the shortest path distance of 
length k from node v;. More formally, this is defined as: 


C=C1UCGU- UG, C= {vidi =k}, k=1,2,---,6;, (18.29) 


where 6; is the upper bound of the hop count from other nodes to v;, dj; is the length 
of the path p;;, and 6; is the union of all the k-hop shortest path neighbor sets Ck. 
Based on these sets, one-hot encodings dj; € RÊ are created for pairs of nodes v;, vj, 
where v; € @j, according to their distance d;;. Then, the GNN model is guided to 
extract node embeddings that encode node topological distance as follows: 


ZLa= YY lce(fo(lzonn,i— Zann, jl), dij), (18.30) 


ViEV vjet 


where fw is a function mapping the difference between two node embeddings to 
the probabilities of pairs of nodes belonging to the corresponding category of the 
topological distance. Since the number of the categories depends on the upper bound 
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of the hop count (topological distance) but precisely determining this upper bound 
is time-consuming for a big graph, it is assumed that the number of hops (distance) 
is under control based on small-world phenomenon (Newman, 2018) and is further 
divided into several major categories that clearly discriminates the dissimilarity and 
partly tolerates the similarity. Experiments demonstrate that dividing the topological 
distance into four categories: 6, 62, g? 5 gk (k > 4) achieves the best performance 
(i.e., 6=4). Another problem is that the number of nodes that are close to the focal 
node v; is much less than the nodes that are further away (i.e., the magnitude of gò 
will be significantly larger than other sets). To circumvent this imbalance problem, 
node pairs are sampled with an adaptive ratio. 

Network motifs are recurrent and statistically significant subgraphs of a larger 
graph and (Zhang et al, 2020f) designs a pretext task to train a GNN encoder that can 
automatically extract graph motifs. The learned motifs are further leveraged to gen- 
erate informative subgraphs used in graph-subgraph contrastive learning. Firstly, a 
GNN-based encoder fọ and a m-slot embedding table {m), ...,m,,} denoting m clus- 
ter centers of m motifs are initialized. Then, a node affinity matrix U € RI”1XI”] is 
calculated by softmax normalization on the embedding similarity 2 (ZGNN;i, ZGNN, j) 
between nodes į, j as in Eq. equation 18.13. Afterwards, spectral clustering (VON- 
LUXBURG, 2007) is performed on U to generate different groups, within which 
ng connected components that have more than three nodes are collected as the sam- 
pled subgraphs from the graph % and their embeddings are calculated by apply- 
ing READOUT function. For each subgraph, its cosine similarity to each of the m 
motifs is calculated to obtain a similarity metric S € R”*"%. To produce semantic- 
meaningful subgraphs that are close to motifs, the top 10% most similar subgraphs 
to each motif are selected based on the similarity metric S and are collected into a 
set YP. The affinity values in U between pairs of nodes in each of these subgraphs 
are increased by optimizing the loss: 

1 | 
yon È yy UT, &). (18.31) 


=l (v jve) €Z, 


=- 


The optimization of the above loss forces nodes in motif-like subgraphs to be more 
likely to be grouped together in spectral clustering, which leads to more subgraph 
samples aligned with the motifs. Next, the embedding table of motifs is optimized 
based on the sampled subgraphs. The assignment matrix Q € R”*"# is found by 
maximizing similarities between embeddings and its assigned motif: 


max Tr(Q's) )- 7 Et ‘| log Oli, j], (18.32) 


where the second term controlled by hyperparameter À is to avoid all representa- 
tions collapsing into a single cluster center. After the cluster assignment matrix Q is 
obtained, the GNN-based encoder and the motif embedding table are trained, which 
is equivalent to a supervised m-class classification problem with labels Q and the 
prediction distribution S obtained by applying a column-wise softmax normaliza- 
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tion with temperature T: 


ng 


A=- tee! qi, i), (18.33) 


Ng * = 


where q; = QJ:,i] and §; = S{:,i] denote the assignment distribution and predicted 
distribution for the subgraph i, respectively. Optimizing Eq. equation 18.33 jointly 
enhances the ability of GNN encoder to extract subgraphs that are similar to mo- 
tifs and improves the embeddings of motifs. The last step is to train the GNN- 
based encoder by a classification task where subgraphs are reassigned back to their 
corresponding graphs. Note that the subgraphs are generated by the Motif-guided 
extractor, which is more likely to capture higher-level semantic information com- 
pared with randomly sampled subgraphs. The whole framework is trained jointly 
by weighted combining -“ , 2 and the contrastive loss. 

Aside from the network motifs, other subgraph structures can be leveraged to 
provide extra supervision in designing pretext tasks. In (Qiu et al, 2020a), an r-ego 
network for a certain vertex is defined as the subgraph induced by nodes that have 
shortest path with length shorter than r. Then a random walk with restart is initiated 
at ego vertex v; and the subgraph induced by nodes that are visited during the random 
walk starting at v; are used as the augmented version of the r-ego network. First, two 
augmented r-ego networks centered around vertex v; are obtained by performing the 
random walk twice (i.e., Y and GY), which are defined as a positive pair since they 
come from the same r-ego network. In comparison, a negative pair corresponds to 
two subgraphs augmented from different r-ego networks (e.g., one coming from v; 
and another coming from v; resulting in random walk induced subgraphs % and ;, 
respectively). Based on the above defined positive and negative subgraph pairs, a 
contrastive loss is set up to optimize the GNNs as follows: 


E burxent(Zis 221 Z), (18.34) 
(GiGi EPT 


where DW fd < denotes the GNN-based oe embeddings and specifically here the 
two different views are the same a= = Ze £* contains positive pairs of sub- 
graphs (GG) sampled by random walk starting at the same ego vertex v; in the 


same graph while P7 = Uwe, G*\e pr Py, represents all sets of negative samples. 
FisFi i fi 


Specifically Py, represents subgraphs sampled by random walk starting at either 
different ego vertex from v; in Y or directly sampled by random walk in different 
graphs from %. 

Although Graph Attention Network (GAT) (Petar et al, 2018) achieves perfor- 
mance improvements over the original GCN (Kipf and Welling, 2017b), there is 
little understanding of what graph attention learns. To this end, Kim and Oh (2021) 
proposes a specific pretext task to leverage the edge information to supervise what 
graph attention learns: 
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1 
a= BOR Yo 1G, i) € £) -log xij + 1((i, 4) € 6) log(1 — xis), 
(ji EEUET 

(18.35) 
where & is the set of edges, £7 is the sampled set of node pairs without edges, 
and ¥;; is the edge probability between node i, j calculated from their embeddings. 
Based on two primary edge attentions, the GAT attention (shortly as GO) (Petar 
et al, 2018) and the dot-product attention (shortly as DP) (Luong et al, 2015), two 
advanced attention mechanisms, SuperGATsp (Scaled Dot-product, shortly as SD) 
and SuperGATmy (Mixed GO and DP, shortly as MX) are proposed: 


€ij sp = ei, pe/VF, Xijsp = O(eij,sp), (18.36) 


€ij MX = @ij,Go°O(eijpP), Xij Mx = O(eijpP), (18.37) 


where o denotes the sigmoid function taking the edge weight e;; and calculating the 
edge probability 7;;. SuperGAT sp divides the dot-product of edge e;; pp by a square 
root of dimension as Transformer (Vaswani et al, 2017) to prevent some large values 
from dominating the entire attention after softmax. SuperGAT my multiplies GO and 
DP attention with sigmoid, which is motivated by the gating mechanism of Gated 
Recurrent Units (GRUs) (Cho et al, 2014a). Since DP attention with the sigmoid 
denotes the edge probability, multiplying o(e;; pp) in calculating e;; mx can softly 
drop neighbors that are not likely linked while implicitly assigning importance to 
the remaining nodes. e;j pp, €;j,go are the weight of edge (i, j) used to calculate the 
GO and DP attention. Results disclose several insightful discovers including the GO 
attention learns label-agreement better than DP, whereas DP predicts edge presence 
better than GO, and the performance of the attention mechanism is not fixed but 
depends on homophily and average degree of the specific graph. 

The topological information can also be generated manually for designing pretext 
tasks. Gao et al (2021) proposes to encode the transformation information between 
two different graph topologies in the representations of nodes obtained by GNNs. 
First, they transform the original graph adjacency matrix A into A by randomly 
adding or removing edges from the original edge set. Then, by feeding the original 
and transformed graph topology and the node feature matrix into any GNN-based 
encoder, the feature representation Zann; ZGNN before and after topology transfor- 
mation are calculated and their difference AZ € R“**" is defined as: 

]" = [ŻGNN,1 — ZGNN,1 + 2GNNw ~ZGNNwI '- 
(18.38) 

Next they predict the topology transformation between node v; and v; through the 

node-wise feature difference AZ by constructing the edge representation as: 


AZ = ŻoNN — ZGNN = [AZGNN,1 +: AZGNN,N 


oe exp(—(Az; — Az;) © (Az; — Az;)) 
1 \lexp(—(4z; — Azj) © (Az: — Az)))||’ 


(18.39) 


where © denotes the Hardamard product. This edge representation e;; is then fed 
into an MLP for the prediction of the topological transformation, which includes 
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four classes: edge addition, edge deletion, keeping disconnection and keeping con- 
nection between each pair of nodes. Thus, the GNN-based encoder is trained by: 


Log = $ ¢ce(MLP(ej;/), tii) (18.40) 


1 
2 
171 vivjEY 
where we denote the topological transformation category between nodes v; and v; 
as one-hot encoding t;; € R4. 


18.5.2 Feature-based Pretext Tasks 


Typically, graphs do not come with any feature information and here the graph-level 
features refer to the graph embeddings obtained after applying a pooling layer on 
all node embeddings from GNNs. 

GraphCL (You et al, 2020b) designs the pretext task to first augment graphs 
by four different augmentations including node dropping, edge perturbation, at- 
tribute masking and subgraph extraction and then maximize the mutual information 
of the graph embeddings between different augmented views generated from the 
same original graph while also minimizing the mutual information of the graph em- 
beddings between different augmented views generated from different graphs. The 
graph embeddings Zssı are obtained through any permutational-invariant READ- 
OUT function on node embeddings followed by applying an adaptation layer. Then 
the mutual information is maximized by optimizing the following NT-Xent con- 
trastive loss: 


Yur xent(Zs Zá Z), (18.41) 


where Zlo Zj represent graph embeddings under two different views. The view 
could be the original view without any augmentation or the one generated from ap- 
plying four different augmentations. Y* contains positive pairs of graphs (%,Y;) 
augmented from the same original graph while Y~ = Uig ae p+ Py. represents 
all sets of negative samples. Specifically Py, contains graphs augmented from the 
graph different from %;. Numerical results demonstrate that the augmentation of 
edge perturbations benefits social networks but hurts biochemical molecules. Ap- 
plying attribute masking achieves better performance in denser graphs. Node drop- 
ping and subgraph extraction are generally beneficial across all datasets. 


414 Yu Wang, Wei Jin, and Tyler Derr 


= r-hop neighborhood 


© Center node Context graph 


© Context anchor node 


Fig. 18.6: An example of a context and r-neighborhood graph. 


18.5.3 Hybrid Pretext Tasks 


One way to use the information of the training nodes in designing pretext tasks is 
developed in (Hu et al, 2020c) where the context concept is raised. The goal of this 
work is to pre-train a GNN so that it maps nodes appearing in similar graph structure 
contexts to nearby embeddings. For every node v;, the r-hop neighborhood of v; 
contains all nodes and edges that are at most r-hops away from v; in the graph. The 
context graph of v; is a subgraph between rj-hops and r2-hops away from node v;. 
It is required that rı < r so that some nodes are shared between the neighborhood 
and the context graph, which is referred to as context anchor nodes. Examples of 
neighborhood and context graphs are shown in Fig. 18.6. Two GNN encoders are set 
up: the main GNN encoder is to get the node embedding Zóny; based on their r-hop 
neighborhood node features and the context GNN is to get the node embeddings 
of every other node in the context anchor node set, which are then averaged to 
get the node context embedding c;. Then Hu et al (2020c) used negative sampling 
to jointly learn the main GNN and the context GNN. In the optimization process, 
positive samples refer to the situation when the center node of the context and the 
neighborhood graphs is the same while the negative samples refer to the situation 
when the center nodes of the context and the neighborhood graphs are different. The 
learning objective is a binary classification of whether a particular neighborhood and 
a particular context graph have the same center node and the negative likelihood loss 
is used as follows: 


1 r T T: 
Zu=-(rA È} (vilog(o((zGnn,i) | €7)) +1 —yi) log(1 — 0 ((ZGyn) €;)))) 
A 
vivi) EX 

(18.42) 
where y; = 1 for the positive sample where i = j while y; = 0 for the negative sample 
where i Æ j, with % denoting the set of positive and negative pairs, and o is the 
sigmoid function computing the probability. 
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Similar idea to employ the context concept in completing pretext tasks is also 
proposed in (Jin et al, 2020d). Specifically, the context here is defined as: 


Pas Ty (vi,¢)| + lEn i c)| 
O Bolol 


where ⁄, and % denote the unlabeled and labeled node set, Ty, (vi) denotes the 
unlabeled nodes that are adjacency to node v;, I'y, (v;,c) denotes the unlabeled nodes 
that have been assigned class c and are adjacency to node v;, My (vi) denotes the 
labeled nodes that are adjacency to node v;, Fý (v;,c) denotes the labeled nodes that 
are adjacency to node v; and of class c. To generate labels for the unlabeled nodes so 
as to calculate the context vector y; for each node v;, label propagation (LP) (ZHU, 
2002) or the iterative classification algorithm (ICA) (Neville and Jensen, 2000) is 
used to construct pseudo labels for unlabeled nodes in %,. Then the pretext task is 
approached by optimizing the following loss function: 


c=1,...,1, (18.43) 


rsi = L lcel (Zssl,i yi), (18.44) 
mi vieV 


The main issue of the above pretext task is the error caused by generating la- 
bels from LP or ICA. The paper (Jin et al, 2020d) further proposed two methods 
to improve the above pretext task. The first method is to replace the procedure of 
assigning labels of unlabeled nodes based on only one method such as LP or ICA 
with assigning labels by ensembling results from multiple different methods. Their 
second method treats the initial labeling from LP or ICA as noisy labels, and then 
leverages an iterative approach (Han et al, 2019) to improve the context vectors, 
which leads to significant improvements based on this correction phase. 

One previous pretext task is to recover the topological distance between nodes. 
However, calculating the distance of the shortest path for all pairs of nodes even 
after the sampling is time-consuming. Therefore, Jin (Jin et al, 2020d) replaces the 
pairwise distance between nodes with the distance between nodes and their corre- 
sponding clusters. For each cluster, a fixed set of anchor/center nodes is established. 
For each node, its distance to this set of anchor nodes is calculated. The pretext task 
is to extract node features that encode the information of this node2cluster distance. 
Suppose k clusters are obtained by applying the METIS graph partitioning algo- 
rithm (Karypis and Kumar, 1998) and the node with the highest degree is assumed 
to be the center of the corresponding cluster, then each node v; will have a clus- 
ter distance vector d; € R* and the distance-to-cluster pretext task is completed by 
optimizing: 


Lig = y mse ( (Zeal ae) (18.45) 


wi vev 

Aside from the graph topology and the node features, the distribution of the train- 
ing nodes and their training labels are another valuable source of information for 
designing pretext tasks. One of the pretext tasks in (Jin et al, 2020d) is to require 
the node embeddings output by GNNs to encode the information of the topological 
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distance from any node to the training nodes. Assuming that the total number of 
classes is p and for class c € {1,..., p} and the node v; € Y, the average, minimum 
and maximum shortest path length from v; to all labeled nodes in class c is calcu- 
lated and denoted as d; € R??, then the objective is to optimize the same regression 
loss as defined in Eq. equation 18.45 

The generating process of networks encodes abundant information for design- 
ing pretext tasks. Hu et al (2020d) propose the GPT-GNN framework for generative 
pre-training of GNNs. This framework performs attribute and edge generation to 
enable the pre-trained model to capture the inherent dependency between node at- 
tributes and graph structure. Assuming that the likelihood over this graph by this 
GNN model is p(¥;@) which represents how the nodes in Y are attributed and 
connected, GPT-GNN aims to pre-train the GNN model by maximizing the graph 
likelihood, i.e., 0* = maxg p(Y; 0). Given a permutated order, the log likelihood is 
factorized autoregressively - generating one node per iteration as: 


17| 
log po (X, £) = X log po (xi, &|X<i, &<i) (18.46) 
i=l 


For all nodes that are generated before the node i, their attributes X<;, and the edges 
between these nodes é<; are used to generate a new node v;, including both its at- 
tribute x; and its connections with existing nodes 4j. Instead of directly assuming 
that x;,é; are independent, they devise a dependency-aware factorization mecha- 
nism to maintain the dependency between node attributes and edge existence. The 
generation process can be decomposed into two coupled parts: (1) generating node 
attributes given the observed edges, and (2) generating the remaining edges given 
the observed edges and the generated node attributes. For computing the loss of 
attribute generation, the generated node feature matrix X is corrupted by masking 
some dimensions to obtain the corrupted version ÑA! and further fed together with 
the generated edges into GNNs to get the embeddings Ze Then, the decoder 
Dect"(.) is specified, which takes Za as input and outputs the predicted attributes 
Dec^'(ŻAtr ). The attribute generation loss is: 


1 ‘ 
LS" = YL buise(Dec™ (@GNW,) ¥i); (18.47) 
| | vjeV 
where ION i = oa [i,:]' denotes the decoded embedding of node v;. For com- 


puting the loss of edge reconstruction, the original generated node feature matrix X 
is directly fed together with the generated edges into GNNs to get the embeddings 


Zz. Then the contrastive NT-Xent loss is calculated: 


Edge _ 
La > | Pr 


yy wee Za (18.48) 
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r ; ne f : anne = 
where Y ~ contains positive pairs of connected nodes (v;,v;) while 27 = Unive? Fi; 
represents all sets of negative samples and P}; contains all nodes that are not directly 


linked with node v;. Note here two views are set equal, i.e., Z l= Z= zie, 


18.6 Node-graph-level SSL Pretext Tasks 


All the above pretext tasks are designed based on either the node or the graph level 
supervision. However, there is another final line of research combining these two 
sources of supervision to design pretext tasks, which we summarize in this section. 

Veličković et al (2019) proposed to maximize the mutual information between 
representations of high-level graphs and low-level patches. In each iteration, a nega- 
tive sample X A is generated by corrupting the graph through shuffling node features 
and removing edges. Then a GNN-based encoder is applied to extract node repre- 
sentations ZGnn and Zann, which are also named as the local patch representations. 
The local patch representations are further fed into an injective readout function to 
get the global graph representations zgnn.y = READOUT (Zgnn). Then the mutual 
information between Zgnn and Zgnn y is Maximized by minimizing the following 
loss function: 


i La 
[PHHP 


P| 
ah L E£ â) [log(1 — o (Zéyn,W2cnn.9))])) 
j=l 


Los = tix allog o (Zóny, W Zann, )] (18.49) 


where |Y*| and |27| are the number of the positive and negative pairs, o stands 
for any nonlinear activation function and PReLU is used in (Veličković et al, 2019), 
ZGnN iW ZONN g calculates the weighted similarity between the patch representation 
centered at node v; and the graph representation. A linear classifier is followed up 
to classify nodes after the above contrastive pretext task. 

Similar to (Veličković et al, 2019) where the mutual information between the 
patch representations and the graph representations is maximized, Hassani and 
Khasahmadi (2020) proposed another framework of contrasting the node represen- 
tations of one view and the graph representations of another view. The first view is 
the original graph and the second view is generated by a graph diffusion matrix. The 
heat and personalized PageRank (PPR) diffusion matrix are considered, which are: 


sheat — exp(tADT! — t), (18.50) 


SPPR = a(I„ — (1 — B)D71PAD "2t, (18.51) 


where f denotes teleport probability, ¢ is the diffusion time, and D is the diago- 
nal degree matrix. After D is obtained, two different GNN encoders followed by a 
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shared projection head are applied on nodes in the original graph adjacency matrix 
and the generated diffusion matrix to get two different node embeddings ZN and 
ZWN: Two different graph embeddings ZNN and ZGNN are further obtained by 
applying a graph pooling function to the node representations (before the projec- 
tion head) and followed by another shared projection head. The mutual information 
between nodes and graphs in different views is maximized through: 


1 
Les = “WH y (MI(26nn,iZGNN9) +MI(zénnisZGnn.g))s (18.52) 
viEV 


where the MI represents the mutual information estimator and four estimators are 
explored, which are noise-contrastive estimator, Jensen-Shannon estimator, normal- 
ized temperature-scaled cross-entropy, and Donsker-Varadhan representation of the 
KL-divergence. Note that the mutual information in Eq. equation 18.52 is averaged 
over all graphs in the original work (Hassani and Khasahmadi, 2020). Addition- 
ally, their results demonstrate that Jensen-Shannon estimator achieves better results 
across all graph classification tasks, whereas in the node classification task, noise 
contrastive estimation achieves better results. They also discover that increasing the 
number of views does not increase the performance on downstream tasks. 


18.7 Discussion 


Existing methods employing self-supervision to graph neural networks achieve per- 
formance improvements and numerous insightful results are also discovered in the 
meantime. While most of the self-supervised pretext tasks are helpful for the down- 
stream tasks, there are still a fair proportion of pretext tasks that bring weak im- 
provement or even fail to boost the performance (Gao et al, 2021; Jin et al, 2020d; 
Manessi and Rozza, 2020; You et al, 2020c). This is either because these pretext 
tasks are highly unrelated to the primary task, i.e., the encoded features useful for 
pretext tasks are useless or even harmful (Manessi and Rozza, 2020) for down- 
stream tasks or because the information learned from completing pretext tasks can 
already be learned from completing downstream tasks by GNNs (Jin et al, 2020d). 
Besides, the strength of the performance improvement depends on the specific GNN 
architecture used for completing pretext and downstream tasks. The improvements 
are more significant for basic GNNs such as GCN, GAT, and GIN while less for 
more advanced GNNs such as GMNN (You et al, 2020c). Furthermore, one pretext 
task is not universally the best across multiple datasets (Gao et al, 2021; Manessi 
and Rozza, 2020). Therefore, whether a self-supervised pretext task helps GNNs in 
the standard target performance is determined by first whether the dataset allows 
the GNNs to extract extra feature information through completing pretext tasks, 
and second whether the extra self-supervised information complement, contradict 
to or has already been covered by information extracted from existing architec- 
ture (You et al, 2020c). Numerous works focus on applying contrastive learning 
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as a form of self-supervised learning (Chen et al, 2020b; Hassani and Khasahmadi, 
2020; Veličković et al, 2019; You et al, 2020b; Zhu et al, 2021). Generally they find 
that while composing different augmentations benefits the performance (You et al, 
2020b), increasing the number of views generated from the same graph augmenta- 
tion technique to more than two cause no further improvement (Hassani and Khasah- 
madi, 2020), which is different from visual representation learning. Moreover, the 
beneficial combinations of augmentations are data-specific because of the highly 
heterogeneous nature of the graph-structured data and harder contrastive tasks are 
more helpful than overly simple ones (You et al, 2020b). Therefore, designing viable 
pretext tasks requires domain specific knowledge and should be targeted towards 
specific types of networks, GNN architectures and downstream tasks. 


18.8 Summary 


In this chapter, we provided a systemic, categorical and comprehensive overview on 
the recent works leveraging self-supervised learning in graph neural networks. De- 
spite recent successes achieved by applying self-supervised learning in the text and 
image domains, self-supervised learning applied to the graph domain, especially 
for graph neural networks, is still in its emerging stage. Several promising direc- 
tions could be pursued to further advance this field. First, although a large surge of 
research focuses on designing effective pretext tasks boosting the performance of 
graph neural networks, few works focus on visualizing, interpreting and explaining 
the underlying reason causing such beneficial performance improvements. Deeply 
understanding the intrinsic mechanism as to why and how SSL helps GNNs could 
help us design more powerful pretext tasks. Second, similar to the work defining 
the architectural design space for GNNs to quickly query the best GNN design 
for a novel task on a novel dataset (You et al, 2020a), we should collect and clas- 
sify various pretext tasks and create a design space for SSL in GNNs. This allows 
for transferring the best designs of pretext tasks across different downstream tasks, 
GNN architectures and datasets. We hope that this chapter can shed some light on 
the main ideas of applying self-supervised learning to graph neural networks and 
related applications in order to encourage progress in the field. 


420 


Yu Wang, Wei Jin, and Tyler Derr 


Editor’s Notes: Although methods introduced in the previous chapter 
(chapter 4, 5, 6, 15, and 16) have achieved state-of-the-art performances in 
corresponding tasks, they require large annotated datasets. Self-supervised 
learning seeks to create and utilize pretext labels on unlabeled data. Pre- 
text tasks are relevant to traditional graph analysis tasks, such as node-level 
tasks (chapter 4) and graph level tasks (chapter 9), while pretext tasks use 
pseudo labels. The development of self-supervised GNN is of great signif- 
icance to domains where labeled data are difficult to obtain, such as drug 
development (chapter 24). Besides, domains that have accumulated a large 
number of unlabeled data sets, such as computer vision (chapter 20) and 
natural language processing (chapter 21), also benefit from self-supervised 
learning. 
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Abstract Graph is an expressive and powerful data structure that is widely applica- 
ble, due to its flexibility and effectiveness in modeling and representing graph struc- 
ture data. It has been more and more popular in various fields, including biology, 
finance, transportation, social network, among many others. Recommender system, 
one of the most successful commercial applications of the artificial intelligence, 
whose user-item interactions can naturally fit into graph structure data, also receives 
much attention in applying graph neural networks (GNNs). We first summarize the 
most recent advancements of GNNs, especially in the recommender systems. Then 
we share our two case studies, dynamic GNN learning and device-cloud collabora- 
tive Learning for GNNs. We finalize with discussions regarding the future directions 
of GNNs in practice. 


19.1 Graph Neural Networks for Recommender System in 
Practice 


19.1.1 Introduction 


The Introduction of GNNs Graph has a long history originated from the Seven 
Bridges of Königsberg problem in 1736 (Biggs et al, 1986). It is flexible to model 
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complex relationships among individuals, which makes it a ubiquitous data struc- 
ture widely applied in numerous fields, e.g., biology, finance, transportation, social 
network, recommender systems. 

Despite there are traditional topics of extracting deterministic information in 
graph theory like shortest path, connected components, local clustering, graph iso- 
morphism, and efc., machine learning applications for graph data focus more on 
predicting the missing parts or future dynamics. Among these applications, the most 
typical research problems studied in recent year, are predicting whether there exists 
or will emerge an edge between two nodes (link prediction), and inferring node-level 
or graph-level labels (node/graph classification). 

The recent progress in deep learning leads to a booming learning paradigm called 
representation learning, which also becomes the de facto standard in solving graph 
machine learning problems. The idea of graph representation learning is to encode 
graph primitives as real-valued vectors in the same metric space, which are then 
involved in downstream applications. The encoder takes as input the original graph 
such as node attributes vector and graph adjacency matrix in an end-to-end fash- 
ion, rather than traditional methods that require extracting heuristic features such as 
betweenness centrality, pagerank value, number of closed triangles. 

Next, we summarize recent graph node representation techniques in a unified 
framework and focus only on the link prediction task. We illustrate several repre- 
sentative approaches in recent literature from a node-centric perspective, since the 
node-centric view can naturally fit into scalable message passing implementations 
that are originally popular in graph mining community (Malewicz et al, 2010; Y.Low 
et al, 2012) and then borrowed to GNNs community (Wang et al, 2019f; Zhu et al, 
2019c). 

For a graph Y = (V,é&) with adjacency matrix A, a standard graph neural net- 
work model has the following components. 


e An ego-network extractor EGO that extracts a local subgraph around the node 
v. This local subgraph is also referred to as the receptive field of v which is then 
used by the node encoder. 

An encoder ENC that maps each node v € ¥ into a vector in a metric space 
R“, The encoder takes as input the ego-network of v, as well as any node/edge 
representation in EGO(v). A similarity function is defined on R? to measure 
how close two nodes appear to be. 

A learning objective &. We do not discuss node classification here and only 
focus on unsupervised node representation learning. The objective can be re- 
constructing the adjacency matrix A, transformations of A, or any sampled form 
of A and its transformations. 


Random Walk-style 


Early graph representation learning approaches (Perozzi et al, 2014; Tang et al, 
2015b; Cao et al, 2015; Zhou et al, 2017; Ou et al, 2016; Grover and Leskovec, 
2016) in deep learning era are inspired by word2vec (Mikolov et al, 2013b), an ef- 
ficient word embedding method in natural language processing community. These 
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methods do not need any neighborhood for encoding, where EGO plays as an iden- 
tity mapping. The encoder ENC takes as the node id in the graph and assigns a 
trainable vector to each node. 

The very different part of these methods is the learning objective. Approaches 
like Deepwalk, LINE, Node2vec use different random walk strategies to create pos- 
itive node pairs (u,v) as the training example, and estimate the probability of visiting 
v given u, p(v|u), as a multinomial distribution, 


P exp(sim(u,v)) 
P(v|u) = Ly exp(sim(u,v’))’ 


where sim is a similarity function. They exploit an approximated Noise Constrained 
Estimation (NCE) loss (Gutmann and Hyvärinen, 2010), known as skip gram with 
negative sampling originated in word2vec as the following, to reduce the high com- 
putation cost, 


log o(sim(u,v)) +k log(1 — o(sim(u,v’))). 


yi 
V'~ neg 


neg iS a proposed negative distribution, which impacts the variation of the optimiza- 
tion target (Yang et al, 2020d). Note that this formula can be also approximated with 
sampled softmax (Bengio and Senécal, 2008; Jean et al, 2014), which in our expe- 
rience performs better in top-k recommendation tasks as the node number becomes 
extremely large (Zhou et al, 2020a). 

These learning objectives have connections with traditional node proximity mea- 
surements in graph mining community. GraRep (Cao et al, 2015), APP (Zhou et al, 
2017) borrows the idea from (Levy and Goldberg, 2014) and point out these random 
walk based method are equivalent to preserving their corresponding transformations 
of the adjacency matrix A, such as personalized pagerank. 


Matrix Factorization-style 


HOPE (Ou et al, 2016) provides a generalized matrix form of other types of node 
proximity measurement, e.g., katz, adamic-adar, and adopts matrix factorization to 
learn embedding that preserve these proximity. NetMF (Qiu et al, 2018) unifies 
several classic graph embedding methods in the framework of matrix factorization, 
provides connections between the deepwalk-like approaches and the theory of graph 
Laplacian. 


GNN-style 


Graph neural network (Kipf and Welling, 2017b; Scarselli et al, 2008) provides 
an end-to-end semi-supervised learning paradigm that was previously modeled via 
label propagations. It can also be used to learn node representations in an unsu- 
pervised manner like the above graph embedding methods. GNN-like approaches 
for unsupervised learning, compared to deepwalk-like methods, are more power- 
ful in capturing local structural, e.g., have at most the power of WL-test (Xu et al, 
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2019d). The downstream link prediction task that requires local-structural aware 
representation or cooperation with node features may benefit more from GNN-style 
approaches. 

The EGO operator collects and constructs the receptive field of each node. For 
GCN (Kipf and Welling, 2017b), a full k-layer neighborhood is required for each 
node, making it hard to work for large graphs which usually follow power-law de- 
gree distribution. GraphSage (Hamilton et al, 2017b) instead samples a fixed-size 
neighborhood in each layer, mitigates this problem and can scale to large graphs. 
LCGNN (Qiu et al, 2021) samples a local cluster around each node by short random- 
walks with theoretical guarantee. 

Then different kinds of Aggregation functions are proposed within this receptive 
field. GraphSage investigates several neighborhood aggregation alternatives, includ- 
ing mean/max pooling, LSTM. GAT (Veličković et al, 2018) utilizes self-attention 
to perform the aggregation, which shows stable and superior performance in many 
graph benchmarks. GIN (Xu et al, 2019d) has a slightly different aggregation func- 
tion, whose discriminative/representational power is proved to be equal to the power 
of the WL test. As link prediction task may also consider structural similarity be- 
tween two nodes besides their distance, this local structural preserving method may 
achieve good performance for networks that have obvious local structural patterns. 

The learning objectives of GNN-style approaches are similar with those in ran- 
dom walk style ones. 


Introduction of Modern Recommender System 


Recommender system, one of the most successful commercial applications of the 
artificial intelligence, whose user-item interactions can naturally fit into graph struc- 
ture data, also receives much attention to applying GNNs. We now give a brief in- 
troduction about the problem settings, the classic methods in recommender systems. 

The user-item relationships are the most typical form of recommender systems, 
e.g., news recommendation, e-commerce recommendation, video recommendation. 
Although recommender systems are eventually optimizing for a complex ecosystem 
of multi-sided participants (Abdollahpouri et al, 2020), i.e., the users, the platform 
and the content provider, we only focus on how the platform will maximize the 
user-side utility in this chapter. 

In a user-item recommender system .” with recommender algorithm ~/, Y is 
the user set and J is the item set. At timestamp f, a user u E€ Y visits Z, a list 
of items -4%,; is produced by æ. u takes positive actions, e.g., click, buy, play, on 
parts of the items in „zt, referred to as Fup while performing the corresponding 
negative actions on the others, e.g., not click, not buy, not play, referred to as .4,;. 

The basic data collected from an industrial recommender system, can be de- 
scribed as 


Paget, 


u,t? 


Falu E Dt}. (19.1) 
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The short-term objective ! of an algorithm in modern recommender systems, can be 
summarized as 
A£ =argmax } Utility( I), (19.2) 
AÁ u,t 

in which the Utility function could be considered as maximizing click through rate, 
GMV, or a mixture of multiple objectives (Ribeiro et al, 2014; McNee et al, 2006). 

A modern commercial recommender system, especially for those with over mil- 
lions of end-users and items, has adopted a multi-stage modeling pipeline as the 
tradeoff between the business goals and the efficiency given the constraints of lim- 
ited computing resources. Different stages have different simplifications of the data 
organization and objectives, which many research papers do not put in a clear way. 

In the following, we first review several simplifications of the industrial recom- 
mendation problem setting, that are clean enough for the research community. Then 
we describe the multi-stage pipeline and the problem in each stage, review clas- 
sic methods to handle the problem and revisit how GNNs are applied in existing 
methods, trying to give an objective view about these methods. 


Simplifications of the collected data. 


e Impression bias. The user feedback data generated under algorithm .2/, has 
a bias towards estimating the oracle user preference. This critical and unique 
problem for recommender system, is usually not considered, especially for the 
early works. 

e Negative feedback. |Z%,,|, the number of negative behaviors in one display, is 
orders of magnitude larger than eae , and very few dataset has collected nega- 
tive feedback. Most of the well-known papers in the research community ignore 
those true negative user feedback, instead, they simulate negative feedback by 
sampling from a proposal distribution, which is not the ground truth and the 
metrics designed over the simulated feedback may not reveal true performance. 

e Temporal information. Early studies prefer a static view of recommendation, 
which eliminates the temporal information of t in the user behavior sequences. 


Multi-stage model pipeline in modern recommender systems. 


e Retrieval Phase. This phase is also referred to as candidate generation or recall 
phase. It narrows down the collection of relevant items from billions to hundreds 
via efficient similarity-based learning, indexing, and searching. To prevent from 
sticking into dead loops caused by fitting the exposure distribution, retrieval 
phase has to independently provide sufficient diversity for different downstream 
purposes or strategies, while retaining the accuracy. As the candidate set is in 
extremely large size, approaches in the recall phase are usually in the form 
of point-wise modeling that is simple to build sophisticated index and perform 


' We indicate the short-term objective as the objective in the sense of each request response. Here 
we do not consider further impacts on the ecosystem brought by an algorithm. 
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Table 19.1: Data simplifications in different settings 


Setting / Phase in Pipeline Data Simplification 


Matrix Completion / Retrieval Phase Dy ={ L uE Y} 
Click Through Rate Prediction / Rank Phase Dy =4{( 27, Z uE Y} 
Sequential Recommendation / Retrieval Phase Dy = 4 (t, Zlu E Yt} 


efficient retrieval. The most widely used measurement for this phase is the top-k 
hit ratio. 

e Rank Phase. The problem space is quite different from those in the retrieval 
phase, since rank phase needs to give precise comparison within a much smaller 
subspace, instead of recalling as many as good items from the entire item candi- 
dates set. Restricted to a small number of candidates, it is capable of exploiting 
more complex methods over the user-item interaction in acceptable response 
time. 

e Re-rank Phase. Considering the effects studied in the discrete choice model (Train, 
1986), the relationships among the displayed items may have significant im- 
pacts on the user behavior. This poses opportunities to consider from the combi- 
national optimization perspective, i.e., how to chose a combination of the subset 
which maximizes the whole utilities of the recommendation list. 


The above stages can be adjusted according to different characteristics of the recom- 
mendation scenario. For example, if the candidate set is at hundreds or thousands, 
recall phase is not necessarily required as the computation power is usually enough 
to cover such rank-all operation at once. The re-rank phase is also not necessary if 
the item number per request is few. 

We summarize in Table 19.1 the different data simplifications made in different 
problem settings with their corresponding pipeline stages. 


19.1.2 Classic Approaches to Predict User-Item Preference 


The fundamental ability required by Recommender System is to predict the possi- 
bility that a user will take actions on a specific displayed item, which we refer to as 
the point-wise preference estimation, p(item|user). Now we review several classic 
approaches in dealing with the cleanest setting of Matrix Completion in Table 19.1. 

The user-item iteraction matrix perspective of data organization Dy = {Z |u € 
U yis M = {Muilu E Y ,i € FZ}, where each row M, = Y,". The famous Collabo- 
rative Filtering methods in recommendation can be categorized into neighborhood- 
based one and model-based one. 
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Neighborhood-based Approaches 


Item-based collaborative filtering first identifies a set of similar items for each of the 
items that the user has clicked/purchased/rated, and then recommends top-N items 
by aggregating the similarities. User-based CF, on the other hand, identifies similar 
users and then performs aggregation on their clicked items. 

The key part in Neighborhood-based Approaches is the definition of the similar- 
ity metric. Take item-based CF as an example, top-k heuristic approaches calculate 
item-item similarity from the user-item interaction matrix M, e.g., pearson corre- 
lation, cosine similarity. Storing |.%|x|.%| similarity score pairs is intractable. In- 
stead, to help produce a top-k recommendation list efficiently, neighborhood-based 
k-nearest-neighbor CF usually memorizes top few similar items for each item, re- 
sulting in a sparse similarity matrix C. Despite the heuristics, SLIM (Ning and 
Karypis, 2011) learns such sparse similarity by reconstructing M via MC with zero 
diagonal and sparse constraints in C. 

One draw back of storing only the sparse similarity is that, it cannot identify 
less-similar relationships which restricts its downstream applications. 


Model-based Approaches 


Model-based methods learn similarity functions between user and item by optimiz- 
ing an objective function. Matrix Factorization, the prior of which is that the user- 
behavior matrix is low-rank, i.e., all users’ tastes can be described by linear com- 
binations of a few style latent factors. The prediction for a user’s preference on an 
item can be calculated as the dot product of the corresponding user and item factor. 


19.1.3 Item Recommendation in user-item Recommender Systems: 
a Bipartite Graph Perspective 


The matrix completion setting also has an equivalent form in bipartite graph, 


G=(V,E8), (19.3) 


where Y = YW U.J, i.e., the union of the user set ~% and the item set .%, and 
E ={(u,i)|ic 4, u E€ Y}, i.e, the collection of the edges between u and his/her 
clicked i. Then the point-wise user-item preference estimation can be viewed as a 
link prediction task in this user-item interaction bipartite graph. 

Heuristic graph mining approaches, which fall into the category of neighborhood- 
based CF, are widely used in the retrieval phase. We can calculate user-item similar- 
ity by performing graph mining tasks like Common Neighbors, Adar (Adamic and 
Adar, 2003), Katz (Katz, 1953), Personalized PageRank (Haveliwala, 2002), over 
the original bipartite graph, or calculate item-item similarity on its induced item- 
item correlation graph (Zhou et al, 2017; Wang et al, 2018b) which are then used in 
the final user preference aggregation. 
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Graph embedding techniques for industrial recommender system are first ex- 
plored in (Zhou et al, 2017) and its successor with side information support (Wang 
et al, 2018b). They construct an item correlation graph of billions of edges from 
user-item click sequences organized by sessions. Then a deepwalk-style graph em- 
bedding method is applied to calculate the item representations, which then provides 
item-item similarities in the retrieval phase. Though it’s shown in (Zhou et al, 2017) 
that embedding based method has advantage in scenarios where the top-k heuristics 
cannot provide any item-pair similarity, it’s still debatable whether the similarity 
given by graph embedding methods can outperform carefully designed heuristic 
ones when all the top-k similar item can be retrieved. 

We also note that, graph embedding techniques can be regarded as matrix factor- 
ization for a transformation of the graph adjacency matrix A, as discussed in earlier 
sections. That means, theoretically the difference between graph embedding tech- 
niques and the basic matrix factorization are their priors, i.e., what matrix is assumed 
to be the best to factorize. Factorization of the transformations of A indicates to fit 
an evolved system in the future while traditional MF methods are factorizing the 
current static system. 

Graph neural networks for industrial recommender system are first studied 
in (Ying et al, 2018b), whose backend model is a variant of GraphSage. PinSage 
computes the L1 normalized visit counts of nodes during random walks started 
from a given node v, and the top-k counted nodes are regarded as v’s receptive field. 
Weighted aggregation is performed among the nodes according to their normalized 
counts. As GraphSage-like approaches do not suffer from too large neighborhood, 
PinSage is scalable to web-scale recommender system with millions of users and 
items. It adopts a triplet loss, instead of NCE-variants that are usually used in other 
papers. 

We want to discuss more about the choice of negative examples in representa- 
tion learning based recommender models, including GNNs, in the retrieval phase. 
As retrieval phase aims to retrieve the k most relevant items from the entire item 
space, it’s crucial to keep an item’s global position far from all irrelevant items. 
In an industrial system with an extremely large candidate set, we find the perfor- 
mance of any representation-based model very sensitive to the choice of negative 
samples and the loss function. Though there seems a trend in mixing all kinds of 
hand-crafted hard examples (Ying et al, 2018b; Huang et al, 2020b; Grbovic and 
Cheng, 2018) in binary cross entropy loss or triplet loss, unfortunately, it has even 
no theoretical support that can lead us to the right direction. In practice, we find 
it a good choice to apply sampled softmax (Jean et al, 2014; Bengio and Senécal, 
2008), InfoNCE (Zhou et al, 2020a) in the retrieval phase with an extremely large 
candidate set, where the latter has also an effect of debiasing. 

GNNs are a useful tool to incorporate with relational features of user and item. 
KGCN (Wang et al, 2019e) enhances the item representation by performing ag- 
gregations among its corresponding entity neighborhood in a knowledge graph. 
KGNN-LS (Wang et al, 2019c) further poses a label smoothness assumption, which 
posits that similar items in the knowledge graph are likely to have similar user pref- 
erence. It adds a regularization term to help learn such a personalized weighted 
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knowledge graph. KGAT (Wang et al, 2019j) shares a generally similar idea with 
KGCN. The only main difference is an auxiliary loss for knowledge graph recon- 
struction. 

Despite there are many more paper discussing about how to fuse external knowl- 
edge, relationships of other entities, which all argue it’s beneficial for downstream 
recommendation tasks, one should seriously consider whether its system needs such 
external knowledge or it will introduce more noises than benefits. 


19.2 Case Study 1: Dynamic Graph Neural Networks Learning 
19.2.1 Dynamic Sequential Graph 


In a recommender, we can obtain a list of user-item interaction tuples & = {(u,i,t)} 
observed in a time window, where the user u € Y interacts with an item i € J 
associated with an timestamp t € R*+. For a user u € Y (or an item i € .¥) at time 
t, we define the 1-depth dynamic sequential subgraph of user u (or item i) at time 
t as a set of interactions of user u (or item i) before time ¢ in chronological order, 
denoted by Ay) = {(u,i,2)|t < t, (u,i, 1) € &} (or YY = {(wi,2)|t < t, (ui, T) € 
&}). Given the k-depth dynamic sequential subgraphs gi ) forie s (or gi) for 
u E€ Y), we define the (k+1)-depth dynamic sequential subgraph of user u (or item 
i) at time f as a set of k-depth dynamic sequential subgraphs that user u (or item 
i) interacts a ee order with its 1-depth oe sequential subgraphs, 
Gas D — (gle <t, (mi, t) € E}UGY (or GY = {4t <t, (ui, t) € E}U 

Gl Dy, The illustration of DSG is shown in Figure 19.1. We define the historical 
behavior sequence of user u (or item i) at time t as a sequence of interacted items 
(or users) in chronological order, denoted by Sut = {(i,T)|t < t, (u,i, T) € £} (or 
Aiy = {(u,t)|t <t, (u,i,t) € &}). 
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(a) Dynamic sequential graphs in recommendation. (b) An example of a user’s 3-depth DSG. 


Fig. 19.1: Illustration of Dynamic Sequential Graph. DSG is a heterogeneous time- 
evolving dynamic graph combining the high-hop connectivity in graphs and the 
temporal dependency in sequences. DSG is constructed from bottom to top recur- 
sively. 
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19.2.2 DSGL: Dynamic Sequential Graph Learning 


19.2.2.1 Overview 
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Fig. 19.2: Framework of the proposed DSGL method. DSGL constructs DSGs for 
the target user u (left) and the candidate item i (right) respectively. Their representa- 
tions are refined with multiple aggregation layers, each of which consists of a time- 
aware sequence encoding layer and a second-order graph attention layer. DSGL gets 
the final representations via layer combination followed by an MLP-based predic- 
tion layer. Modules of the same color share the same set of parameters. 


Based on the constructed user-item interaction DSG, we propose the edge learn- 
ing model named Dynamic Sequential Graph Learning (DSGL), as illustrated in 
Figure 19.2. The basic idea of DSGL is to perform graph convolution iteratively on 
the DSGs for the target user and the candidate item on their corresponding devices, 
by aggregating the embeddings of neighbors as the new representation of a target 
node. The aggregator consists of two parts: (1) the time-aware sequence encoding 
that encodes the behavior sequence with time information and temporal dependency 
captured; and (2) the second-order graph attention that activates the related behavior 
in the sequence to eliminate noisy information. Besides the above two components, 
we also propose an embedding layer that initializes user, item, and time embed- 
dings, a layer combination module that combines the embeddings of multiple layers 
to achieve final representations, and a prediction layer that outputs the prediction 
score. 
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19.2.2.2 Embedding Layer 


There are four groups of inputs in the proposed DSGL: the target user u, the candi- 
date item i, the k-depth DSGs of the target user Ik, and (k-1)-depth DSGs of the can- 


didate item GEL, For each field of discrete features, such as age, gender,category, 
brand, and ID, we represent it as an embedding matrix. By concatenating all fields 
of features, we have the node feature of items, denoted by firem € Rå, Similarly, 
fuser € R” represents the concatenated embedding vectors of fields in the category 
of user. As for the interaction timestamp in DSG, we compute the time intervals 
between the interaction time and its parent interaction time as time decays. Given 
a historical behavior sequence .%,; of user u at the timestamp żt, each interaction 
(u,i, T) E Fut corresponds to a time decay Alu it) =t — T. Following (Li et al, 
2020g), we transform the continuous time decay values to discrete features by map- 
ping them to a series of buckets with the ranges [b°,b!),[b',b?),...,[b!,b'*!), where 
the base b is a hyper-parameter. Then by performing the embedding lookup opera- 
tion, the time decay embedding can be obtained, denoted by frime € R4, 


19.2.2.3 Time-Aware Sequence Encoding 


The nodes at each layer of DSGs are in time order, which reflects the time-varying 
preference of users as well as the popularity evolution of items. Thus we perform 
sequence modeling as a part of GNN to capture the dynamics of the interaction se- 
quences. We design a time-aware sequential encoder to utilize the time information 
explicitly. For each interaction (u, i,t), we have the historical behavior sequence .%, 1 
of user u and .Y,; of item i. For sequence .%,;, by feeding each interacted item along 
with the time decay in the sequence into the embedding layer, the behavior embed- 
ding sequence is formed with the combined feature sequence, as {e; r| (i, T) € Sut}, 
where e;r = [fitem;3frime,] € IR4+4 is the embedding of item i in the sequence. Sim- 
ilarly, for sequence 4+, we have the embedding sequence as {e,,-|(u,t) E Fir} 
where €u, = (fuser, 3 ftime,] € Ra +4 We take the obtained embedding as the zero- 
layer of inputs in the time-aware sequence encoder, i.e., x) =e, and x.) =}. 
For ease of notation, we will drop the superscript in the rest of the following two 
subsections. 

In the time-aware sequence encoding, we infer the hidden state of each node in 
the behavior sequence step by step in a RNN-based manner. Given the behavior 
sequences Su, and Air, we represent j-th item’s hidden states and inputs in the 
sequence Sus as hirem; and Xitemj> and j-th user’s hidden states and inputs in the 
sequence .Y; as huser; and Xuser j- The forward formulas are 


hirem; = Hitem (hirem; Xitem;); hyser; = Hauser (huser; »Xuser;)- (19.4) 


where Hiser(-,-) and Aem(-,+) represent the encoding functions specific to user and 
item, respectively. We adopt the long short-term memory (LSTM) (Hochreiter and 
Schmidhuber, 1997) as the encoder instead of the Transformer (Vaswani et al, 2017), 
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since LSTM can utilize time feature to control the information to be propagated 
with the time decay feature as inputs. After the time-aware sequence encoding, we 
obtain the corresponding hidden states sequence of historical behavior sequence 
Fut of user u and Z+ of item i. The time-aware sequence encoding functions can 
be represented as: 


LSTM item ({Xi,c| (i, T) E Suth) = {h; z| (i, T) E Suak 


LSTMyser({Xu,¢l (ut) € Arh) = {hurl T) € Az}. (13.5) 


19.2.2.4 Second-Order Graph Attention 


In practice, there may exist noisy neighbors, whose interest or audience is irrele- 
vant to the target node. To eliminate the noise brought by the unreliable nodes, we 
propose an attention mechanism to activate related nodes in the behavior sequence. 
Traditional graph attention mechanism, like GAT (Veličković et al, 2018), computes 
attention weights between the central node and the neighbor nodes, which indicate 
the importance of each neighbor node to the central node. Although they perform 
well on the node classification task, they may increase noise diffusion for recom- 
mendation when there exists an unreliable connection. 

To address the above problem, we propose a graph attention mechanism that uses 
both the parent node of the central node and the central node itself to build the query 
and takes the neighbor nodes as the key and value. Since we use the parent node of 
the central node to enhance the expressive power of the query, which is connected 
to the key node with two hops, we name it second-order graph attention. The parent 
node of the central node can be seen as a complement when the central node is 
unreliable, thus improving the robustness. 

Following the scaled dot-product attention (Vaswani et al, 2017), the attention 
function is defined as 


softmax (QK ' ) 


vd 


where Q, K and V represent the query, key and value, respectively, and d is the 
dimension of K and Q. The multi-head attention is defined as follows: 


Attention(Q,K, V) = V (19.6) 


MultiHead(Q,K,V) = [head;; headz;...;head}]Wo (19.7) 


head; = Attention(QWo,,KWx,,V Wv; ) (19.8) 


where weights Wo, Wg, Wy and Wo are trained parameters. 

Given the behavior hidden states sequence {hj ;|(i,t) € Sut} and {h, 1| (u, T) € 
Sı} after the time-aware sequence encoding, we represents the attention process 
as: 
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Xut = ATTitem({hi,c| (i, T) € Sut}; Xit = ATTuser({hħhu r| (u,T) € Sith). (19.9) 


19.2.2.5 Aggregation and Layer Combination 


The core idea of GCN is to learn representation for nodes by performing convolution 
over their neighborhood. In DSGL, we stack the time-aware sequence encoding and 
the second-order graph attention, and the aggregator can be represented as: 


xh = ATT irem (LSTMirem( {x li € Zus} )); 
x; H1) = ATT yser(LSTMyser({x0 li € Fith))- 


(19.10) 


Different from traditional GCN models that use the last layer as the final node rep- 
resentation, inspired by (He et al, 2020), we combine the embeddings obtained at 
each layer to form the final representation of a user (an item): 


Ja (19.11) 


where K,, and K; denote the numbers of DSGL layers for user u and item i, respec- 
tively. 


19.2.3 Model Prediction 


Given an interaction triplet (u,i,t), we can predict the possibility of the user inter- 
acting with the item as: 


S= F (u, Go gE DO) = MLP (|en 1; eir; uri Ria]) (19.12) 


it 


where MLP(-) represents the MLP layer and © denotes the network parameters. We 
adopt the cross-entropy loss function: 


L=- $, [ylogf+(1—y)log(1—9)] dna 
(u,i,t,y)ED 


where 2 is the set of training samples, and y € {0,1} denotes the real label. The 
algorithm procedure is presented in Algorithm 1. 
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Algorithm 2 The algorithm of DSGL. 


Input: 
The training set J = {(u,i,t,y)}; User set Y; Item set 7; Interaction set &; Depths kų, ki; 
Number of epochs E. 
Output: Network parameters ©. 
1: Initialize input feature fuser, of user u € Y and firem; of item i € J; 
2: fore + 1 to E do 


3: for (u,i,t,y) E€ 2 do 
4 Construct DSGs git), gt i) for user u and item i from &; 
5 for (v,j,7) € Ge UG do 
6: Obtain the behavior sequence r and .Y;,7; 
(0) . y (0) Sh 
7 Xt Tey, Xp © €j 
8: for k & 1 to k, do 
9: x0) © ATT rem (LSTMirem (AXE; |i € A,c}))s 
10: end for 
11: for k + 1 to k; do 
12: x") &— ATT yser(LSTMuser({x¢ "li € Yj,c}))s 
13: end for 
14: end for i : 7 
15: Kut < ie Fej x), Kip Eki x; 
16: Fuit — MLP (|en; eit; Rut; Rit]; 
17: Update the parameters © by optimizing Eq.19.13; 
18: end for 
19: end for=0 


19.2.4 Experiments and Discussions 


We evaluate our methods on the real-world Amazon product datasets”, and use 
five subsets. The widely used metrics for the CTR prediction task, i.e., AUC (the 
area under the ROC curve) and Logloss, are adopted. The compared recommen- 
dation methods can be grouped into five categories, including conventional meth- 
ods (SVD++ (Koren, 2008) and PNN (Qu et al, 2016)), sequential methods with 
user behaviors (GRU4Rec (Hidasi et al, 2015), CASER (Tang and Wang, 2018), 
ATRANK (Zhou et al, 2018a) and DIN (Zhou et al, 2018b)), sequential methods 
with user and item behaviors (Topo-LSTM (Wang et al, 2017b), TIEN (Li et al, 
2020g) and DIB (Guo et al, 2019a)), static-graph-based methods (NGCF (Wang 
et al, 2019k) and LightGCN (He et al, 2020)), and dynamic-graph-based method 
(SR-GNN (Wt et al, 2019c)). 


19.2.4.1 Performance Comparison 


To demonstrate the performance of the proposed model, we compare DSGL with 
the state-of-the-art recommendation methods. We find that DSGL consistently out- 


> http://snap.stanford.edu/data/amazon/productGraph/ 
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performs all other baselines, demonstrating its effectiveness. The sequential models 
outperform the conventional methods by a large margin, proving the effectiveness of 
capturing temporal dependency in recommendation. The sequential methods which 
model both user behaviors and item behaviors outperform the methods that only use 
the user behavior sequences, which verifies the importance of both user- and item- 
side behavior information. The performance of the static-graph-based methods, in- 
cluding LightGCN and NGCF, are not competitive. The reasons are two folds. First, 
these methods ignore the new interactions in the testing set in the inference phase. 
Second, since they do not model the temporal dependency of interactions, they 
cannot capture the evolving interests, degrading the performances compared with 
sequential models. The session-graph-based method SR-GNN outperforms static- 
graph-based methods, because SR-GNN incorporates all the interacted items before 
the current moment into graphs dynamically. However, it underperforms the se- 
quential methods. One possible reason could be that the ratio of repeated items in 
the sequences is low in the Amazon datasets, and the transitions of items are not 
complex enough to be modeled as graphs. 


19.2.4.2 Effectiveness of Graph Structure and Layer Combination 


To show the effectiveness of the graph structure and layer combination, we compare 
the performance of DSGL and its variant DSGL w/o LC that uses the last layer 
instead of the combined layer as the final representation w.r.t different numbers 
of layers. Focusing on DSGL with layer combination, the performance gradually 
improves with the increase of layers. We attribute the improvement to the collab- 
orative information carried by the second-order and third-order connectivity in the 
graph structure. Comparing DSGL and DSGL w/o LC, we find that removing the 
layer combination degrades the performance largely, which demonstrates the effec- 
tiveness of layer combination. 


19.2.4.3 Effectiveness of Time-Aware Sequence Encoding 


In DSGL, we perform time-aware sequence encoding to preserve both the order of 
behaviors and the time information. Thus, we design ablation experiments to study 
how the temporal dependency and time information in DSGL contributes to the 
final performance. To evaluate the role of time information, we test the removal of 
time feature only of the item bahavior (i.e., DSGL w/o time in UBH), of the user 
behavior (i.e., DSGL w/o time in IBH), and of both behaviors (i.e., DSGL w/o 
time). To evaluate the contribution of the behavior order, we test the removal of 
the sequence encoding module while retaining time information (i.e., DSGL w/o 
Seq ENC) and the removal of the time-aware sequence encoding (i.e., DSGL w/o 
TA Seq ENC). From the comparison, we find that DSGL outperforms DSGL w/o 
TA Seq ENC by a significant margin, demonstrating the efficacy of the time-aware 
sequence encoding layer. Comparing DSGL w/o time, DSGL w/o time in UBH and 
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DSGL w/o time in IBH with the default DSGL, we observe that removing the time 
information on either user or item behavior side will cause performance degradation. 
DSGL outperforms DSGL w/o Seq ENC, confirming the importance of temporal 
dependency carried by the historical behavior sequence. 


19.2.4.4 Effectiveness of Second-Order Graph Attention 


In DSGL, we propose a second-order graph attention to eliminate noise from unre- 
liable neighbors. To justify its rationality, we explore different choices here. We test 
the performance without graph attention (i.e., DSGL w/o ATT). We also replace the 
second-order graph attention with the traditional graph attention (i.e., DSGL-GAT). 
Note that the attention function in DSGL-GAT here is the same as the one in DSGL, 
and the only difference is the query. DSGL-GAT takes the central node as the query. 
From the results, we have the following observations: 


The best setting in all cases is adopting the second-order graph attention (i.e., the 
current design of DSGL). Replacing it with GAT drops the performance, demon- 
strating the effectiveness of second-order attention in activating related neighbors 
and eliminating the noise from reliable neighbors. 

Removing the attention mechanism (i.e., DSGL w/o ATT), the performance de- 
grades largely, worse than DSGL with traditional graph attention. In some cases, 
the performance is even not as good as the best baseline. The observation demon- 
strates the necessity to introduce the attention mechanism in GNN-based recom- 
mendation methods due to the inevitable noise in the multi-hop neighborhood. 


19.3 Case Study 2: Device-Cloud Collaborative Learning for 
Graph Neural Networks 


19.3.1 The proposed framework 


Recently, several works (Sun et al, 2020e; Cai et al, 2020a; Gong et al, 2020; Yang 
et al, 2019e; Lin et al, 2020e; Niu et al, 2020) have explored the on-device comput- 
ing advantages in recommender systems. This drives the development of on-device 
GNNs, e.g., DSGL in the previous section. However, these early works either only 
consider the cloud modeling, or on-device inference, or the aggregation of the tem- 
poral on-device training pieces to handle the privacy constraint. Little has explored 
the device modeling and the cloud modeling jointly to benefit both sides for GNNs. 
To bridge this gap, we introduce a Device-Cloud Collaborative Learning framework 
as shown in Figure 23.2. Given a recommendation dataset {(Xn,¥n)},—1...y» We tar- 
get to learn a GNN-based mapping function f : x, — yn on the cloud side. Here, x, is 
the graph feature that contains all available candidate features and user context, y, is 
the user implicit feedback (click or not) to the corresponding candidate and N is the 
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Fig. 19.3: The general DCCL framework for recommendation. The cloud side is 
responsible to learn the centralized cloud GNN model via the model-over-models 
distillation from the personalized on-device GNN models. The device receives the 
cloud GNN model to conduct the on-device personalization. We propose MoMoDis- 
till and MetaPatch to instantiate each side respectively. 


sample number. On the device side, each device (indexed by m) has its own local 


dataset, Tan, w L NG We add a few parameter-efficient patches (Yuan 


et al, 2020a) to the cloud GNN model f (freezing its parameters on the device side) 


for each device to build a new GNN f (m) ; xt”) —> y, In the following, we will 
present the practical challenges in the deployment and our solutions. 


19.3.1.1 MetaPatch for On-device Personalization 


Although the device hardware has been greatly improved in the recent years, it is 
still resource-constrained to learn a complete big model on the device. Meanwhile, 
only finetuning last few layers is performance-limited due to the feature basis of 
the pretrained layers. Fortunately, some previous works have demonstrated that it is 
possible to achieve the comparable performance as the whole network finetuning via 
patch learning (Cai et al, 2020b; Yuan et al, 2020a; Houlsby et al, 2019). Inspired 
by these works, we insert the model patches on basis of the cloud model f for on- 
device personalization. Formally, the output of the /-th layer attached with one patch 
on the m-th device is expressed as 


FOO = AO) o fil), (19.14) 


where LHS of Eq.19.14 is the sum of the original f;(-) and the patch response of 


fi(-). Here, hi” (-) is the trainable patch function and o denotes the function com- 
position that treats the output of the previous function as the input. Note that, the 
model patch could have different neural architectures. Here, we do not explore its 
variants but specify the same bottleneck architecture like (Houlsby et al, 2019). 
Nevertheless, we empirically find that the parameter space of multiple patches is 
still relatively too large and easily overfits the sparse local samples. To overcome 
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this issue, we propose MetaPatch to reduce the parameter space. It is a kind of meta 
learning methods to generate parameters (Ha et al, 2017; Jia et al, 2016). Concretely, 
assume the parameters of each patch are denoted by a”) (flatten all parameters in 
the patch into a vector). Then, we can deduce the following decomposition 

a” =O; * ô), (19.15) 
where ©; is the globally shared parameter basis (freezing it on the device and learned 
in the cloud) and 6) is the surrogate tunable parameter vector to generate each 
patch parameter a” in the device-GNN-model f (m). To facilitate the understand- 
ing, we term ôC”) as the metapatch parameter. In this paper, we keep the number of 
patch parameters is greatly less than that of the metapatch parameters to be learned 
for personalization. Note that, regarding the pretraining of ©;, we leave the discus- 
sion in the following section to avoid the clutter, since it is learned on the cloud 
side. According to Eq. 19.15, we implement the patch parameter generation via the 
metapatch parameter 6(") instead of directly learning 6"). To learn the metapatch 
parameter, we can leverage the local dataset to minimize the following loss function. 


min £(y, I) | 5 pom (19.16) 


where Zis the pointwise cross-entropy loss, f™ (-) = (im) (Jo: i (Je o f™ (-) 
and L is the number of total layers. After training the device specific parameter 6(™) 
by Eq. 19.16, we can use Eq. 19.15 to generate all patches, and then insert them 
into the cloud GNN model f via Eq. 19.14 to get the final personalized GNN model 
f™ , which will provide the on-device personalized recommendation. 


19.3.1.2 MoMoDistill to Enhance the Cloud Modeling 


The conventional incremental training of the centralized cloud model follows the 
“model-over-data” paradigm. That is, when the new training samples are collected 
from devices, we directly perform the incremental learning based on the model 
trained in the early sample collection. The objective is formulated as follows, 


minl (y, 9) 


7 S=f(x)? (19.17) 


where Wy is the network parameter of the cloud GNN model f to be trained. This is 
an independent perspective without considering the device modeling. However, the 
on-device personalization actually can be more powerful than the centralized cloud 
model to handle the corresponding local samples. Thus, the guidance from the on- 
device models could be a meaningful prior to help the cloud modeling. Inspired 
by this, we propose a “model-over-models” paradigm to simultaneously learn from 
data and aggregate the knowledge from on-device models, to enhance the training of 
the centralized cloud model. Formally, the objective with the distillation procedure 
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on the samples from all devices is defined as, 


min’ (9) +B KL(Y,¥)}, 


=f(x), j= (x)? (19.18) 


where f is the hyperparameter to balance the distillation and “model-over-data” 
learning. Note that, the feasibility of the distillation in Eq. 19.18 critically depends 
on the patch mechanism in the previous section, since it allows us to input the meta- 
patch parameters like features with only loading the other parameters of f (m) in one 
time. Otherwise, we will suffer from the engineering issue of reloading numerous 
checkpoints frequently, which is almost impossible for current frameworks. 

In MetaPatch, we introduce the global parameter basis {©;} (simplified by ©) to 
reduce the parameter space on the device. Regarding its training, we empirically find 
that coupled learning with W; easily falls into undesirable local optimal, since they 
play different roles in terms of their semantics. Therefore, we resort to a progressive 
optimization strategy, that is, first optimize f based on Eq. 19.18, and then distill the 
knowledge for the parameter basis © with the learned f. For the second step, we de- 
sign an auxiliary component by considering the heterogeneous characteristics of the 
metapatches from all devices and the cold-start issue at the beginning. Concretely, 
the device index and u C x is the user motile features (e.g., age, gender, purchase 
level, etc) of the corresponding device, we define the following auxiliary encoder, 


U(6,u) =Wtanh(W? 6+ Wu), (19.19) 


where W(!), ae an are ee projection matrices. Here, we use W, denoting 
the collection {w WwW, w (3)} for simplicity. To learn the global parameter basis, 
we replace 6 by U (6, u) to simulate Eq. 19.15 to generate the model patch, i.e., © x 

U(ô, u), since actually 6 is too heterogeneous to be directly used. Then, gambrig 
OxU (6,u) with f learned in the first distillation step, we can form a new proxy 
device model f (™) (different from f (m) in the patch generation). Here, we leverage 
such a proxy f (m) to directly distill the knowledge from the true f (m) collected from 
devices, which optimizes © and the parameters of the auxiliary encoder, 


e 3) +B KLE, iy = fm) (x) $= f0) (x)? (19.20) 
Eq. 19.18 and Eq. 19.20 progressively help learn the centralized cloud model and the 
global parameter basis. We specially term this progressive distillation mechanism as 
MoMoDistill to emphasize our “model-over-models” paradigm different from the 
conventional “model-over-data’” incremental training on the cloud side. Finally, in 
Algorithm 3, we summarize the complete procedure of DCCL. 
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Algorithm 3 Device-Cloud Collaborative Learning for GNNs 


Pretrain the cloud GNN model f, and then learn the global parameter basis © based on Eq. 19.20 
by setting @ as 0. 


while lifecycle do Send f and © to devices. 

Device(f, ©): œ MetaPatch 

1) Accumulate the local data into batches 

2) On-device personalization via Eq.19.16 

3) If time > threshold: upload personalized GNN model f(”) 
4) Else: return the step 1). 

Recycle all model patches {6(”)}. 

Cloud({6)}): œ> MoMoDistill 

1) Optimize the cloud GNN model f based on Eq.19.18 

2) Learn the parameter basis © by Eq.19.20 


19.3.2 Experiments and Discussions 


To demonstrate the effectiveness of the proposed framework, we conduct a range 
of experiments on three recommendation datasets Amazon, Movielens-1M and 
Taobao. Generally, all these three datasets are user interactive history in sequence 
format, and the last user interacted item is cut out as test sample. For each last in- 
teracted item, we randomly sample 100 items that do not appear in the user history. 
We compare our framework with some classical cloud models, namely, the conven- 
tional methods MF (Koren et al, 2009) and FM (Rendle, 2010), deep learning-based 
methods NeuMF (He et al, 2017b) and DeepFM (Guo et al, 2017), and sequence- 
based methods SASRec (Kang and McAuley, 2018) and DIN (Zhou et al, 2018b). 
For the whole experiments, we implement our model on the basis of DIN, where 
we insert the model patches in the last second fully-connected layer and the first 
two fully-connected layers after the feature embedding layer. In all comparisons, 
we term MetaPatch as DCCL-e, and MoMoDistill as DCCL-m, since the whole 
framework resembles EM iterations. The default method to compare the baselines 
is named DCCL, which indicates that it goes through both on-device personaliza- 


tion and the “model-over-models” distillation. The performance are measured by 
HitRate, NDCG and macro-AUC. 


19.3.2.1 How is the performance of DCCL compared with the SOTAs? 


To demonstrate the effectiveness of DCCL, we conduct the experiments on Ama- 
zon, Movielens and Taobao to compare to a range of baselines. Aligned with the 
popular experimental settings (He et al, 2017b; Zhou et al, 2018b), the last inter- 
active item of each user on three datasets is left for evaluation and all items before 
the last one are used for training. For DCCL, we split the training data into two 
parts on average according to the temporal order: one part is for the pretraining 
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of the backbone (DIN) and the other part is for the training of DCCL. In the ex- 
periments, we conduct one-round DCCL-e and DCCL-m. Finally, the DCCL-m is 
used to compare with the six representative models. We find that the deep learning 
based methods NeuMF and DeepFM usually outperform the conventional methods 
MF and FM, and the sequence-based methods SASRec and DIN consistently out- 
perform previous non-sequence-based methods. Our DCCL builds upon on the best 
baseline DIN and further improves its results. Specifically, DCCL shows about 2% 
or more improvements in terms of NDCG@10, and at least 1% improvements in 
terms of HitRate@10 on all three datasets. The performances on both small and 
large datasets confirm the superiority of our DCCL. 


19.3.2.2 Whether on-device personalization benefits to the cloud model? 


In this section, we target to demonstrate that how on-device personalization via 
MetaPatch (abbreviated as DCCl-e) can improve the recommendation performance 
from different levels of users compared with the centralized cloud model. Consid- 
ering the data scale and the availability of the context information for visualization, 
only the Taobao dataset is used to conduct this experiment. To validate the per- 
formance of DCCL-e in the fine-grained granularity, we sort the users based on 
their sample numbers and then partition them into 20 groups on average along the 
sorted user axis (see the statistic of the sample number w.r.t. the user in the ap- 
pendix). After on-device model personalization, we calculate the performance for 
each group based on the personalized models. Here, the macro-AUC metric is used, 
which equally treats the users in the group instead of the group AUC in (Zhou et al, 
2018b). 

We use DIN as baseline and pretrain it on the Taobao Dataset of the first 20 
days. Then, we test the model in the data of the remaining 10 days. For DCCL-e, 
we first pretrain DIN on the Taobao Dataset of the first 10 days, and then insert 
the patches into the pretrained DIN same as previous settings. Finally, we perform 
the on-device personalization in the subsequent 10 days. Similarly, we test DCCL-e 
on the data of the last 10 days. The evaluation is respectively conducted in the 20 
groups. According to the results, we find that with the increase of the group index 
number, the performance approximately decreases. This is because the users in the 
group of larger indices are more like the long-tailed users based on our partition, and 
their patterns are easily ignored or even sacrificed by the centralized cloud model. 
In comparison, DCCL-e shows the consistent improvement over DIN on all groups, 
and especially can achieve a large improvement in long-tailed user groups. 


19.3.2.3 The iterative characteristics of the multi-round DCCL. 
To illustrate the convergence property of DCCL, we conduct the experiments on the 


Taobao dataset in different device-cloud interaction temporal intervals. Concretely, 
we specify every 2, 5, 10 days interactions between device and cloud, and respec- 
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tively trace the performance of each round evaluated on the last click of each user. 
According to the results, we observe that frequent interactions achieve much better 
performance than the infrequent counterparts. We speculate that, as MeatPatch and 
MoMoDistill could promote each other at every round, the advantages in perfor- 
mance have been continuously strengthened with more frequent interactions. How- 
ever, the side effect is we have to frequently update the on-device models, which 
may introduce other uncertain crash risks. Thus, in the real-world scenarios, we 
need to make a trade-off between performance and the interaction interval. 


19.3.2.4 Ablation Study of DCCL 


For the first study, we given the results of the one-round DCCL on the Taobao 
dataset and compare with DIN. From the results, we can observe the progressive 
improvement after DCCL-e and DCCL-m, and DCCL-m acquires more benefit than 
DCCL-e in terms of the improvement. The revenue behind DCCL-e is MetaPatch 
customizes a personalized model for each user to improve their recommendation ex- 
perience once new behavior logs are collected on device, without the delayed update 
from the centralized cloud server. The further improvements from DCCL-m confirm 
the necessity of MoMoDistill to re-calibrate the backbone and the parameter basis 
in a long term. However, if we conduct the experiments without our two modules, 
the model performance is as DIN, which is not better than DCCL. 

For the second ablation study, we explore the effect of the model patches in dif- 
ferent layer junctions. In previous sections, we insert two patches (1st Junction, 2nd 
Junction) in the two fully-connected layers respectively after the feature embedding 
layer, and one patch (3rd Junction) to the layer before the last softmax transforma- 
tion layer. In this experiment, we validate their effectiveness by only keep each of 
them in one-round DCCL. Compared with the full model, we can find that removing 
the model patch would decrease the performance. The results suggest the patches in 
the 1st and 2nd junctions are more effective than the one in the 3rd junction. 


19.4 Future Directions 


Certainly, we have witnessed the arising trends for GNNs to be applied in various 
areas. We believe the following directions should be paid more attention for GNNs 
to have wider impacts in big data areas, especially in search, recommendation or 
advertisement. 


e There is still a lot to understand about GNNs, but there were quite a few im- 
portant results about how they work (Loukas, 2020; Xu et al, 2019d; Oono and 
Suzuki, 2020). Future research works of GNNs should balance between techni- 
cal simplicity, high practical impact, and far-reaching theoretical insights. 

e It is also great to see how GNNs can be applied for other real-world tasks (Wei 
et al, 2019; Wang et al, 2019a; Paliwal et al, 2020; Shi et al, 2019a; Jiang and 
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Balaprakash, 2020; Chen et al, 20200). For example, we see applications in fix- 
ing bugs in Javascript, game playing, answering IQ-like tests, optimization of 
TensorFlow computational graphs, molecule generation, and question genera- 
tion in dialogue systems, among many others. 

It will become popular to see GNNs applied for knowledge graph reasoning 
(Ren et al, 2020; Ye et al, 2019b). A knowledge graph is a structured way to 
represent facts where nodes and edges actually bear some semantic meaning, 
such as the name of the actor or act of playing in movies. 

Recently there are new perspectives on how we should approach learning graph 
representations, especially considering the balance between local and global 
information. For example, Deng et al (2020) presents a way to improve run- 
ning time and accuracy in node classification problem for any unsupervised 
embedding method. Chen et al (2019c) shows that if one replaces a non-linear 
neighborhood aggregation function with its linear counterpart, which includes 
degrees of the neighbors and the propagated graph attributes, then the perfor- 
mance of the model does not decrease. This is aligned with previous statements 
that many graph data sets are trivial for classification and raises a question of 
the proper validation framework for this task. 

Algorithmic works of GNNs should be integrated with system design more 
closely, to empower end-to-end solutions for users to address their scenarios 
by taking graph into deep learning frameworks. It should allow pluggable oper- 
ators to adapt to the fast development of GNN community and excels in graph 
building and sampling. As an independent and portable system, the interfaces 
of AliGraph (Zhu et al, 2019c) can be integrated with any tensor engine that is 
used for expressing neural network models. By co-designing the flexible Grem- 
lin like interfaces for both graph query and sampling, users can customize data 
accessing pattern freely. Moreover, AliGraph also shows excellent performance 
and scalability. 


Editor’s Notes: Recommender system is one of the hottest topics in both 
research and industrial communities due to its huge value in a number of 
commercial businesses such as Amazon, Facebook, LinkedIn, and so on. 
Since user-item interactions, user-user interaction and item-item similarity 
can naturally formulate into graph structure data, various graph represen- 
tation learning techniques (GNN Methods in Chapter 4, GNN Scalability 
in Chapter 6, Graph Structure Learning in Chapter 14, Dynamic GNNs in 
Chapter 15, and Heterogeneous GNNs in Chapter 16) can serve a strong set 
of algorithmic foundations in applying GNNs for developing an effective 
and efficient modern recommendation system. 
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Abstract Recently Graph Neural Networks (GNNs) have been incorporated into 
many Computer Vision (CV) models. They not only bring performance improve- 
ment to many CV-related tasks but also provide more explainable decomposition to 
these CV models. This chapter provides a comprehensive overview of how GNNs 
are applied to various CV tasks, ranging from single image classification to cross- 
media understanding. It also provides a discussion of this rapidly growing field from 
a frontier perspective. 
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20.1 Introduction 


Recent years have seen great success of Convolutional Neural Network (CNN) in 
Computer Vision (CV). However, most of these methods lack the fine-grained anal- 
ysis of relationships among the visual data (e.g., relation visual regions, adjacent 
video frames). For example, an image can be represented as a spatial map while the 
regions in an image are often spatially and semantically dependent. Similarly, video 
can be represented as spatio-temporal graphs, where each node in the graph repre- 
sents a region of interest in the video and the edges capture relationships between 
such regions. These edges can describe the relations and capture the interdepen- 
dence between nodes in the visual data. Such fine-grained dependencies are critical 
to perceiving, understanding, and reasoning the visual data. Therefore, graph neural 
networks can be naturally utilized to extract patterns from these graphs to facilitate 
the corresponding computer vision tasks. 

This chapter introduces the graph neural network model in various computer 
vision tasks, including specific tasks for image, video and cross-media (cross- 
modal) (Zhuang et al, 2017). For each task, this chapter demonstrates how graph 
neural networks can be adapted to and improve the aforementioned computer vision 
tasks with representative algorithms. 

Ultimately, to provide a frontier perspective, we also introduce some other dis- 
tinctive GNN modeling methods and application scenarios on the subfield. 


20.2 Representing Vision as Graphs 


In this section, we introduce the representation of visual graph YY = {V,&}. We 
focus on how to construct nodes V = {v1,v2,...,vy} and edges (or relations) & = 
{e1,€2,...,em} in the visual graph. 


20.2.1 Visual Node representation 


Nodes are essential entities in a graph. There are three kinds of methods to represent 
the node of the image X € R’*”*° or the video X € R/*"*"*¢, where (h,w) is the 
resolution of the original image, c is the number of channels, and f is the number of 
frames. 

Firstly, it is possible to split the image or the frame of the video into regular 
grids referring to Fig. 20.1, each of which is the (p,p) resolution of the image patch 
(Dosovitskiy et al, 2021; Han et al, 2020). Then each grid servers as the vertex of 
the visual graph and apply neural networks to get its embedding. 

Secondly, some pre-processed structures like Fig. 20.2 can be directly borrowed 
for vertex representation. For example, by object detection framework like Faster 
R-CNN (Ren et al, 2015) or YOLO (Heimer et al, 2019), visual regions in the first 
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Fig. 20.1: Split an image into fixed-size patches and view as vertexes 


column of the figure, have been processed and can be thought of as vertexes in the 
graph. We map different regions to the same dimensional features and feed them to 
the next training step. Like the middle column of the figure, scene graph generation 
models (Xu et al, 2017a; Li et al, 20191) not only achieve visual detection but also 
aim to parse an image into a semantic graph which consists of objects and their 
semantic relationships, where it is tractable to get vertexes and edges to deploy 
downstream tasks in the image or video. In the last one, human joints linked by 
skeletons naturally form a graph and learn human action patterns (Jain et al, 2016b; 
Yan et al, 2018a) 


: ; T - z5 al Scene Graph Generation 
foe Ra 


object detection scene graph generation skeleton 


Fig. 20.2: Pre-processed visual graph examples 


At last, some works utilize semantic information to represent visual vertexes. Li 
and Gupta (2018) assigns pixels with similar features to the same vertex, which is 
soft and likely groups pixels into coherent regions. Pixel features in the group are 
further aggregated to form a single vertex feature as Fig. 20.3. Using convolutions 
to learn densely-distributed, low-level patterns, Wu et al (2020a) processes the input 
image with several convolution blocks and treat these features from various filters as 
vertexes to learn more sparsely-distributed, higher-order semantic concepts. A point 
cloud is a set of 3D points recorded by LiDAR scans. Te et al (2018) and Landrieu 
and Simonovsky (2018) aggregate k-nearest neighbor to form superpoint (or vertex) 
and build their relations by ConvGNNs to explore the topological structure and ‘see’ 
the surrounding environment. 
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Fig. 20.3: Grouping similar pixels as vertexes (different colors) 


20.2.2 Visual Edge representation 


Edges depict the relations of nodes and play an important role in graph neural net- 
works.For a 2D image, the nodes in the image can be linked with different spatial 
relations. For a clip of video stacked by continuous frames, it adds temporal rela- 
tions between frames besides spatial ones within the frame. On the one hand, these 
relations can be fixed by predefined rules to train GNNs, referred to as static rela- 
tions. Learning to learn relations (thought of as dynamic relations) attracts more and 
more attention on the other hand. 


20.2.2.1 Spatial Edges 


To capture spatial relations is the key step in the image or video. For static methods, 
generating scene graphs (Xu et al, 2017a) and human skeletons (Jain et al, 201 6b) 
are natural to choose edges between nodes in the visual graph described in the Fig. 
20.2. Recently, some works (Bajaj et al, 2019; Liu et al, 2020g) use fully-connected 
graph (every vertex is linked with other ones) to model the relations among vi- 
sual nodes and compute union region of them to represent edge features. Further- 
more, self-attention mechanism (Yun et al, 2019; Yang et al, 2019f) are introduced 
to learn the relations among visual nodes, whose main idea is inspired by trans- 
former (Vaswani et al, 2017) in NLP. When edges are represented, we can choose 
either spectral-based or spatial-based GNNs for applications (Zhou et al, 2018c; Wu 
et al, 2021d). 


Fig. 20.4: A spatial-temporal graph by extracting nodes from each frame and allow- 
ing directed edges between nodes in neighbouring frames 
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20.2.2.2 Temporal Edges 


To understand the video, the model not only builds spatial relations in a frame but 
also captures temporal connections among frames. A series of methods (Yuan et al, 
2017; Shen et al, 2020; Zhang et al, 2020h) compute each node in the current frame 
with near frames by semantic similarity methods like k-Nearest Neighbors to con- 
struct temporal relations among frames. Especially, as you can see in the Fig. 20.4, 
Jabri et al (2020) represent video as a graph using a Markov chain and learn a ran- 
dom walk among nodes by dynamic adjustment, where nodes are image patches, and 
edges are affinities (in some feature space) between nodes of neighboring frames. 
Zhang et al (2020g) use regions as visual vertexes and evaluate the IoU (Intersection 
of Union) of nodes among frames to represent the weight edges. 


20.3 Case Study 1: Image 


20.3.1 Object Detection 


Object detection is a fundamental and challenging problem in computer vision, 
which received great and lasting attention in recent years. Given a natural image, 
the object detection task seeks to locate the visual object instances from certain cat- 
egories (e.g. humans, animals, or trees). Generally speaking, object detection can 
be grouped into two categories (Liu et al, 2020b): 1) generic object detection and 2) 
salient object detection. The first class aims to detect unlimited instances of objects 
in the digital image and predict their class attributes from some pre-defined cate- 
gories. The goal of the second type is to detect the most salient instance. In recent 
years deep learning-based methods have achieved tremendous success in this field, 
such as Faster-RCNN (Ren et al, 2015), YOLO (Heimer et al, 2019), and etc. Most 
of the early methods and their follow-ups (Ren et al, 2015; He et al, 2017a) usu- 
ally adopt the region selection module to extract the region features and predict the 
active probability for each candidate region. Although they are demonstrated suc- 
cessful, they mostly treat the recognition of each candidate region separately, thus 
leading to nonnegligible performance drops when facing the nontypical and non- 
ideal occasions, such as heavy long-tail data distributions and plenty of confusing 
categories (Xu et al, 2019b). The graph neural network (GNN) is introduced to ef- 
fectively address this troublesome challenge by modeling the correlations between 
regions explicitly and leveraging them to achieve better performance. In this section, 
we will present one typical case SGRN (Xu et al, 2019b) to discuss this promising 
direction. 

The SGRN can be simply divided into two modules: 1) sparse graph learner 
which learns the graph structure explicitly during the training and 2) the spatial- 
aware graph embedding module which leverages the learned graph structure infor- 
mation and obtains the graph representation. To make it clear, we denote the graph 
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as G(V,&), where Y is the vertex set and & is the edge set. The image is .4%. And 
we formulate the regions as R = {i „f; € Rf for a specific image J, where d 
is the region feature’s dimension. We will discuss these two parts and omit other 
details. 

Unlike previous attempts in close fields which build category-to-category graph 
(Dai et al, 2017; Niepert et al, 2016), the SGRN treats the candidate regions R as 
graph nodes % and constructs dynamic graph ¥ on top of them. Technically, they 


project the region features into the latent space z by: 
Zi = (fj) (20.1) 


where @ is the two fully-connected layers with ReLU activation, z; € R! and 1 is the 
latent dimension. 
The region graph is constructed by latent representation z as follows: 

Sij = Ziz) (20.2) 
where S € R””*””, It is not proper to reserve all relations between region pairs since 
there are many negative (i.e., background) samples among the region proposals, 
which may affect the down task’s performance. If we use the dense matrix S as the 
graph adjacency matrix, the graph will be fully-connected, which leads to computa- 
tion burden or performance drop since most existing GNN methods work worse on 
fully-connected graphs (Sun et al, 2019). To solve this issue, the SGRN adopt KNN 
to make the graph sparse (Chen et al, 2020n,o). In other words, for the learned sim- 
ilarity matrix S; € RM, they only keep the K nearest neighbors (including itself) as 
well as the associated similarity scores (i.e., they mask off the remaining similarity 
scores). The learned graph adjacency is denoted as: 


A =KNN(S) (20.3) 


The node’s initial embedding is obtained by the pre-trained visual classifier. We 
omit the details and simply denote it as X = {x;};”.,. The SGRN introduces a spatial- 
aware graph reasoning module to learn the spatial-aware node embedding. Formally, 
they introduce a patch of operator adapted by graph convolutional network (GCN) 
with learnable gaussian kernels, given by: 


fii) = YE oe wi, 7))xjAij (20.4) 
JEN (i) 


where ~V (i) denotes the neighborhood of node i, u(i, j) is the distance of node i, j 
calculated by the center of them in a polar coordinate system, and @,;() is the k—th 
gaussian kernel. Then the K kernels’ results are concatenated together and projected 
to the latent space as follows: 


h; = g([fi 0; As -fk (20.5) 
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where g(-) denotes the projection with non-linearity. Finally, h; is combined with 
the original visual region feature f; to enhance classification and regression perfor- 
mance. 


20.3.2 Image Classification 


Inspired by the success of deep learning techniques, significant improvement has 
been made in the image classification field, such as ResNet (He et al, 2016a). How- 
ever, the CNN-based models are limited in modeling relations between samples. The 
graph neural network is introduced to image classification, which aims to model the 
fine-grained region correlations to enhance classification performance (Hong et al, 
2020a), combining labeled and unlabeled image instances for semi-supervised im- 
age classification (Luo et al, 2016; Satorras and Estrach, 2018). In this section, we 
will present a typical case for semi-supervised image classification to show the ef- 
fectiveness of GNN. 

We denote the data samples as (x;,y;) E€ J, where x; is the image and y; € RE 
is the image label. For semi-supervised setting, the 7 is divided into labeled part 
abeted and unlabeled part Zintabera. We assume that there are N; labeled samples 
and N, unlabeled samples, respectively. The proposed GNN is dynamic and multi- 
layer, which means for each layer, it will learn the graph topology from the previous 
layer’s the node embedding and learn the new embedding on top of it. Thus, we 
denote the layer number as M and only present the detailed graph construction and 
graph embedding techniques of layer k. Technically, they construct the graph for 
the image set and formulate the posterior prediction task as message passing with 
graph neural network. They cast the samples as graph Y(Y,&), whose nodes set 
is the image set consisting of both labeled and unlabeled data. The edge set & is 
constructed during training. 


First, they denote the initial node representation as X = {x;}" 


i=1 _ as follows: 


xi? = ((x;),h(%)) (20.6) 


where @() is the convolutional neural network and h() is the one-hot label encoding. 
Note that for unlabeled data, they replace the A() with the uniform distribution over 
the K-simplex. 

Second, the graph topology is learned by current layer’s node embedding denoted 
as x“. The distance matrix modeling the distance in the embedding space between 
nodes is denoted as S given by: 


St j= 9(%i,X;) (20.7) 
where @ is a parametrized symmetric function as follows: 


(a,b) = MLP(abs(a—b)) (20.8) 
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where MLP() is a multilayer perceptron network and abs() is the absolute function. 
Then the adjacency matrix A is calculated by normalizing the row of S using softmax 
operation. 

Then a GNN layer is adapted to encode the graph nodes with learned topology A. 
The GNN layer receives the node embedding matrix x* and outputs the aggregated 
node representation x‘! as: 


xit! = p( È Bx'0§ ,), l = di..-diy1 (20.9) 
BEA 
where {0,...., Ola} are trainable parameters, and p() is non-linear activate function 


(leaky ReLU here). 

The graph neural network is effective in modeling the unstructured data’s cor- 
relation. In this work, the GNN explicitly exploits the relation between samples, 
especially the labeled and unlabeled data, contributing to few-shot image classifica- 
tion challenges. 


20.4 Case Study 2: Video 


20.4.1 Video Action Recognition 


Action recognition in video is a highly active area of research, which plays a crucial 
role in video understanding. Given a video as input, the task of action recognition 
is to recognize the action appearing in the video and predict the action category. 
Over the past few years, modeling the spatio-temporal nature of video has been the 
core of research in the field of video understanding and action recognition. Early 
approaches of activity recognition such as Hand-crafted Improved Dense Trajec- 
toryGiDT) (Wang and Schmid, 2013), two-Stream ConvNets (Simonyan and Zisser- 
man, 2014a), C3D (Tran et al, 2015), and I3D (Carreira and Zisserman, 2017) have 
focused on using spatio-temporal appearance features. To better model longer-term 
temporal information, researchers also attempted to model the video as an ordered 
frame sequence using Recurrent Neural Networks (RNNs) (Yue-Hei Ng et al, 2015; 
Donahue et al, 2015; Li et al, 2017b). However, these conventional deep learning 
approaches only focus on extracting features from the whole scenes and are unable 
to model the relationships between different object instances in space and time. For 
example, to recognize the action in the video corresponds to “opening a book”, the 
temporal dynamics of objects and human-object and object-object interactions are 
crucial. We need to temporally link book regions across time to capture the shape of 
the book and how it changes over time. 

To capture relations between objects across time, several deep models (Chen 
et al, 2019d; Herzig et al, 2019; Wang and Gupta, 2018; Wang et al, 2018e) have 
been recently introduced that represent the video as spatial-temporal graph and 
leverage recently proposed graph neural networks. These methods take dense ob- 


20 Graph Neural Networks in Computer Vision 455 


ject proposals as graph nodes and learn the relations between them. In this section, 
we take the framework proposed in (Wang and Gupta, 2018) as one example to 
demonstrate how graph neural networks can be applied to action recognition task. 

As illustrated in Fig 20.5, the model takes a long clip of video frames as in- 
put and forwards them to a 3D Convolutional Neural Network to get a feature map 
I c R'%*"xxd where t represents the temporal dimension, h x w represents the spa- 
tial dimensions and d represents the channel number. Then the model adopts the 
Region Proposal Network (RPN) (Ren et al, 2015) to extract the object bounding 
boxes followed by RoIAlign (He et al, 2017a) extracting d-dimension feature for 
each object proposal. The output n object proposals aggregated over t frames are 
corresponding to nodes in the building graphs. There are mainly two types of 
graphs: Similarity Graph and Spatial-Temporal Graph. 


Building Graphs 
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F Graph 
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Fig. 20.5: Overview of the GNN-based model for Video Action Recognition. 


The similarity graph is constructed to measure the similarity between objects. In 
this graph, pairs of semantically related objects are connected. Formally, the pair- 
wise similarity between every two nodes can be represented as: 


1 


F (xi,x;) = ¢(x:)' 0 (xj) (20.10) 


where @ and o represent two different transformations of the original features. 
After computing the similarity matrix, the normalized edge values Ai” from 
node i to node j can be defined as: 


expF (Xi,Xj) 


a = Ss ce (20.11) 
l Li- expF (xi,x;) 
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The spatial-temporal graph is proposed to encode the relative spatial and tempo- 
ral relations between objects, where objects in nearby locations in space and time 
are connected together. The normalized edge values of the spatial-temporal graph 
can be formulated as: 

Oi; 
Li- Oij 


apo = (20.12) 
where G/™°™! represents the forward graph which connects objects from frame t to 
frame t + 1, and 0;; represents the value of Intersection Over Unions (IoUs) between 
object i in frame t and object j in frame t + 1. The backward graph A¢“ can be 
computed in a similar way. Then, the Graph Convolutional Networks (GCNs) (Kipf 
and Welling, 2017b) is applied to update features of each object node. One layer of 
graph convolutions can be represented as: 


Z=AXW (20.13) 


where A represents one of the adjacency matrix (A*”", A/7°™, or AD), X represents 
the node features, and W is the weight matrix of the GCN. 

The updated node features after graph convolutions are forwarded to an average 
pooling layer to obtain the global graph representation. Then, the graph representa- 
tion and pooled video representation are concatenated together for video classifica- 
tion. 


20.4.2 Temporal Action Localization 


Temporal action localization is the task of training a model to predict the bound- 
aries and categories of action instances in untrimmed videos. Most existing meth- 
ods (Chao et al, 2018; Gao et al, 2017; Lin et al, 2017; Shou et al, 2017, 2016; Zeng 
et al, 2019) tackle temporal action localization in a two-stage pipeline: they first gen- 
erate a set of 1D temporal proposals and then perform classification and temporal 
boundary regression on each proposal individually. However, these methods process 
each proposal separately, failing to leverage the semantic relations between propos- 
als. To model the proposal-proposal relations in the video, graph neural networks are 
then adopted to facilitate the recognition of each proposal instance. P-GCN (Zeng 
et al, 2019) is recently proposed method to exploit the proposal-proposal relations 
using Graph Convolutional Networks. P-GCN first constructs an action proposal 
graph, where each proposal is represented as a node and their relations between two 
proposals as an edge. Then P-GCN performs reasoning over the proposal graph us- 
ing GCN to model the relations among different proposals and update their represen- 
tations. Finally, the updated node representations are used to refine their boundaries 
and classification scores based on the established proposal-proposal dependencies. 
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20.5 Other Related Work: Cross-media 


Graph-structured data widely exists in different modal data (images, videos, texts), 
and is used in existing cross-media tasks (e.g., visual caption, visual question an- 
swer, cross-media retrieval). In other words, using of graph structure data and GNN 
rationally can effectively improve the performance of cross-media tasks. 


20.5.1 Visual Caption 


Visual caption aims at building a system that automatically generates a natural lan- 
guage description of a given image or video. The problem of image captioning is 
interesting not only because it has important practical applications, such as helping 
visually impaired people see, but also because it is regarded as a grand challenge 
for vision understanding. The typical solutions of visual captioning are inspired 
by machine translation and equivalent to translating an image to a text. In these 
methods (Li et al, 2017d; Lu et al, 2017a; Ding et al, 2019b), Convolutional Neu- 
ral Network (CNN) or Region-based CNN (R-CNN) is usually exploited to encode 
an image and a decoder of Recurrent Neural Network (RNN) w/ or w/o attention 
mechanism is utilized to generate the sentence. However, a common issue not fully 
studied is how visual relationships should be leveraged in view that the mutual corre- 
lations or interactions between objects are the natural basis for describing an image. 
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Fig. 20.6: Framework of GCN-LSTM. 


In recent years, Yao et al (2018) presented Graph Convolutional Networks plus 
Long Short-Term Memory (GCN-LSTM) architecture, which explores visual rela- 
tionship for boosting image captioning. As shown in Fig. 20.6, they study the prob- 
lem from the viewpoint of modeling mutual interactions between objects/regions to 
enrich region-level representations that are feed into sentence decoder. Specifically, 
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they build two kinds of visual relationships, i.e., semantic and spatial correlations, 
on the detected regions, and devised Graph Convolutions on the region-level rep- 
resentations with visual relationships to learn more powerful representations. Such 
relation-aware region-level representations are then input into attention LSTM for 
sentence generation. 

Then, Yang et al (2019g) presented a novel Scene Graph Auto-Encoder (SGAE) 
for image captioning. This captioning pipeline contains two step: 1) extracting the 
scene graph for an image and using GCN to encode the corresponding scene graph, 
then decoding the sentence by the recoding representation; 2) incorporating the im- 
age scene graph to the captioning model. They also use GCNs to encode the visual 
scene graph . Given the representation of visual scene graph, they introduce joint vi- 
sual and language memory to choose appropriate representation to generate image 
description. 


20.5.2 Visual Question Answering 


Visual Question Answering (VQA) aims at building a system that automatically an- 
swers natural language questions about visual information. It is a challenging task 
that involves mutual understanding and reasoning across different modalities. In the 
past few years, benefiting from the rapid developments of deep learning, the pre- 
vailing image and video question methods (Shah et al, 2019; Zhang et al, 2019g; Yu 
et al, 2017a) prefer to represent the visual and linguistic modalities in a common la- 
tent subspace, use the encoder-decoder framework and attention mechanism, which 
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Fig. 20.7: GNN-based Visual QA. 


However, the aforementioned methods also have not considered the graph infor- 
mation in the VQA task. Recently, Zhang et al (2019a) investigates an alternative 
approach inspired by conventional QA systems that operate on knowledge graphs. 
Specifically, as shown in Fig. 20.7, they investigate the use of scene graphs derived 
from images, then naturally encode information on graphs and perform structured 
reasoning for Visual QA. The experimental results demonstrate that scene graphs, 
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even automatically generated by machines, can definitively benefit Visual QA if 
paired with appropriate models like GNNs. In other words, leveraging scene graphs 
largely increases the Visual QA accuracy on questions related to counting, object 
presence and attributes, and multi-object relationships. 

Another work (Li et al, 2019d) presents the Relation-aware Graph Attention Net- 
work (ReGAT), a novel framework for VQA, to model multi-type object relations 
with question adaptive attention mechanism. A Faster R-CNN is used to generate a 
set of object region proposals, and a question encoder is used for question embed- 
ding. The convolutional and bounding-box features of each region are then injected 
into the relation encoder to learn the relation-aware, question-adaptive, region-level 
representations from the image. These relation-aware visual features and the ques- 
tion embeddings are then fed into a multimodal fusion module to produce a joint 
representation, which is used in the answer prediction module to generate an an- 
swer. 


20.5.3 Cross-Media Retrieval 


Image-text retrieval task has become a popular cross-media research topic in re- 
cent years. It aims to retrieve the most similar samples from the database in an- 
other modality. The key challenge here is how to match the cross-modal data by 
understanding their contents and measuring their semantic similarity. Many ap- 
proaches (Faghri et al, 2017; Gu et al, 2018; Huang et al, 2017b) have been pro- 
posed. They often use global representations or local to express the whole image 
and sentence. Then, a metric is devised to measure the similarity of a couple of 
features in different modalities. However, the above methods lose sight of the re- 
lationships between objects in multi-modal data, which is also the key point for 
image-text retrieval. 


Text Input 


Text 


The students => = = => 
are listening to 


the class. 
Graph Structure Graph Conv 


Image Hand-crafted 
= wr features 
a AAE 


Similarity 


estimation 


mpo m OD p (= Jp) A 
Feature Vector 
-- á features 


Image Input 
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To utilize the graph data in image and text better, as shown in Fig. 20.8, Yu 
et al (2018b) proposes a novel cross-modal retrieval model named dual-path neu- 
ral network with graph convolutional network. This network takes both irregular 
graph-structured textual representations and regular vector-structured visual repre- 
sentations into consideration to jointly learn coupled feature and common latent 
semantic space. 

In addition, Wang et al (2020i) extract objects and relationships from the image 
and text to form the visual scene graph and text scene graph, and design a so-called 
Scene Graph Matching (SGM) model, where two tailored graph encoders encode 
the visual scene graph and text scene graph into the feature graph. After that, both 
object-level and relationship-level features are learned in each graph, and the two 
feature graphs corresponding to two modalities can be finally matched at two levels 
more plausibly. 


20.6 Frontiers for Graph Neural Networks on Computer Vision 


In this section, we introduce the frontiers for GNNs on Computer Vision. We focus 
on the advanced modeling methods of GNN for Computer Vision and their applica- 
tions in a broader area of the subfield. 


20.6.1 Advanced Graph Neural Networks for Computer Vision 


The main idea of the GNN modeling method on CV is to represent visual informa- 
tion as a graph. It is common to represent pixels, object bounding boxes, or image 
frames as nodes and further build a homogeneous graph to model their relations. 
Despite this kind of method, there are also some new ideas for GNN modeling. 

Considering the specific task nature, some works try to represent different forms 
of visual information in the graph. 


e Person Feature Patches Yan et al (2019); Yang et al (2020b); Yan et al (2020b) 
build spatial and temporal graphs for person re-identification (Re-ID). They 
horizontally partition each person feature map into patches and use the patches 
as the nodes of the graph. GCN is further used to modeling the relation of body 
parts across frames. 

Irregular Clustering Regions Liu et al (2020h) introduce the bipartite GNN 
for mammogram mass detection. It first leverages KNN forward mapping to 
partition an image feature map into irregular regions. Then the features in an 
irregular region are further integrated as a node. The bipartite node sets are con- 
structed by cross-view images respectively, while the bipartite edge learns to 
model both inherent cross-view geometric constraints and appearance similari- 
ties. 
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NAS Cells Lin et al (2020c) proposed graph-guided Neural Architecture 
Search (NAS) algorithms. The proposed models represent an operation cell as 
a node and apply the GCNs to model the relationship of cells in network archi- 
tecture search. 


20.6.2 Broader Area of Graph Neural Networks on Computer 


Vision 


In this subsection, we introduce some other application scenarios of GNNs on CV, 
including but not limited to the following: 


Point Cloud Analysis Point Cloud Analysis aims to recognize a set of points 
in a coordinate system. Each point is represented by its three coordinates with 
some other features. In order to utilize CNN, the early works (Chen et al, 2017; 
Yan et al, 2018b; Yang et al, 2018a; Zhou and Tuzel, 2018) convert a point cloud 
to a regular grid such as image and voxel. Recently, a series of works (Chen 
et al, 2020g; Lin et al, 2020f; Xu et al, 2020e; Shi and Rajkumar, 2020; Shu et al, 
2019) use a graph representation to preserve the irregularity of a point cloud. 
GCN plays a similar role as CNN in image processing for aggregating local 
information. Chen et al (2020g) develops a hierarchical graph network structure 
for 3D object detection on point clouds. Lin et al (2020f) proposes a learnable 
GCN kernel and a 3D graph max pooling with a receptive field of K nearest 
neighboring nodes. Xu et al (2020e) proposes a Coverage-Aware Grid Query 
and a Grid Context Aggregation to accelerate 3D scene segmentation. Shi and 
Rajkumar (2020) designs a Point-GNN with an auto-registration mechanism to 
detect multiple objects in a single shot. 

Low Resource Learning Low-resource learning models the ability of learn- 
ing from a very small amount of data or transferring from prior. Some works 
leverage GNN to incorporate structural information for the low-resource image 
classification. Wang et al (2018f); Kampffmeyer et al (2019) use knowledge 
graphs as extra information to guide zero-short image classification. Each node 
corresponds to an object category and takes the word embeddings of nodes as 
input for predicting the classifier of different categories. Except for the knowl- 
edge graph, the similarity between images in the dataset is also helpful for the 
few-shot learning. Garcia and Bruna (2017); Liu et al (2018e); Kim et al (2019) 
set up similarity metrics and further modeling the few-shot learning problem as 
a label propagating or edge-labeling problem. 

Face Recognition Wang et al (2019p) formulates the face clustering task as 
a link prediction problem. It utilizes the GCN to infer the likelihood of link- 
age between pairs in the face sub-graphs. Yang et al (2019d) proposes a 
proposal-detection-segmentation framework for face clustering on an affinity 
graph. Zhang et al (2020b) propose a global-local GCN to perform label cleans- 
ing for face recognition. 
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e Miscellaneous We also introduce some distinctive GNN applications on the 
subfield. Wei et al (2020) proposes a view-GCN to recognizes 3D shape 
through its projected 2D images. Wald et al (2020) extends the concept of scene 
graph to the 3D indoor scene. Ulutan et al (2020) leverage GCNs to reason the 
interactions between humans and objects. Cucurull et al (2019) predicts fashion 
compatibility between two items by formulating an edge prediction problem. 
Sun et al (2020b) builds a social behavior graph from a video and uses GNNs 
to propagate social interaction information for trajectory prediction. Zhang et al 
(20201) builds a vision and language relation graph to alleviate the hallucination 
problem in the grounded video description task. 


20.7 Summary 


This chapter shows that GNN is a promising and fast-developing research field 
that offers exciting opportunities in computer vision techniques. Nevertheless, it 
also presents some challenges. For example, graphs are often related to real scenar- 
ios, while the aforementioned GNNs lack interpretability, especially the decision- 
making problems (e.g., medical diagnostic model) in the computer vision field. 
However, compared to other black-box models (e.g., CNN), interpretability for 
graph-based deep learning is even more challenging since graph nodes and edges 
are often heavily interconnected. Thus, a further direction worth exploring is how to 
improve the interpretability and robustness of GNN for computer vision tasks. 


Editor’s Notes: Convolutional Neural Network has achieved huge success 
in computer vision domain. However, recent years have seen the rise of re- 
lational machine learning like GNNs and Transformers to modeling more 
fine-grained correlations in both images and videos. Certainly, graph struc- 
ture learning techniques in Chapter 14 becomes very important for con- 
structing an optimized graph from an image or a video and learning node 
representations on this learnt implicit graph. Dynamic GNNs in Chapter 
15 will play an important role when coping with a video. GNN Methods 
in Chapter 4 and GNN Scalability in Chapter 6 are then another two basic 
building blocks for the use of GNNs for CV. This chapter is also highly 
correlated with the Chapter 21 (GNN for NLP) since vision and language is 
a fast-growing research area and multi-modality data is widely used today. 
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Graph Neural Networks in Natural Language 
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Bang Liu, Lingfei Wu 


Abstract Natural language processing (NLP) and understanding aim to read from 
unformatted text to accomplish different tasks. While word embeddings learned by 
deep neural networks are widely used, the underlying linguistic and semantic struc- 
tures of text pieces cannot be fully exploited in these representations. Graph is a 
natural way to capture the connections between different text pieces, such as enti- 
ties, sentences, and documents. To overcome the limits in vector space models, re- 
searchers combine deep learning models with graph-structured representations for 
various tasks in NLP and text mining. Such combinations help to make full use of 
both the structural information in text and the representation learning ability of deep 
neural networks. In this chapter, we introduce the various graph representations that 
are extensively used in NLP, and show how different NLP tasks can be tackled from 
a graph perspective. We summarize recent research works on graph-based NLP, and 
discuss two case studies related to graph-based text clustering, matching, and multi- 
hop machine reading comprehension in detail. Finally, we provide a synthesis about 
the important open problems of this subfield. 


21.1 Introduction 


Language serves as a cornerstone of human cognition. Enable machines to under- 
stand natural language is at the very heart of machine intelligence. Natural language 
processing (NLP) concerns with the interaction between machines and human lan- 
guages. It is a critical subfield of computer science, linguistics, and artificial intel- 
ligence (AI). Ever since the early research about machine translation in the 1950s 
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until nowadays, NLP has been playing an essential role in the research of machine 
learning and artificial intelligence. 

NLP has a wide range of applications in the life and business of modern society. 
Critical NLP applications include but not limited to: machine translation applica- 
tions that aim to translate text or speech from a source language to another tar- 
get language (e.g., Google Translation, Yandex Translate); chatbots or virtual assis- 
tants that conduct an on-line chat conversation with a human agent (e.g., Apple Siri, 
Microsoft Cortana, Amazon Alexa); search engines for information retrieval (e.g., 
Google, Baidu, Bing); question answering (QA) and machine reading comprehen- 
sion in different fields and applications (e.g., open-domain question answering in 
search engines, medical question answering); knowledge graphs and ontologies that 
extract and represent knowledge from multi-sources to improve various applications 
(e.g., DBpedia (Bizer et al, 2009), Google Knowledge Graph); and recommender 
systems in E-commerce based on text analysis (e.g., E-commerce recommendation 
in Alibaba and Amazon). Therefore, AI breakthroughs in NLP are big for business. 

Two crucial research problems lie at the core of NLP: i) how to represent natural 
language texts in a format that computers can read; and ii) how to compute based 
on the input format to understand the input text pieces. We observe that researchers’ 
ideas on representing and modeling text keep evolving during the long history of 
NLP development. 

Up to the 1980s, most NLP systems were symbolic-based. Different text pieces 
were considered as symbols, and the models for various NLP tasks were imple- 
mented based on complex sets of hand-written rules. For example, classic rule-based 
machine translation (RBMT) involves a host of rules defined by linguists in gram- 
mar books. Such systems include Systran, Reverso, Prompt, and LOGOS (Hutchins, 
1995). Rule-based approaches with symbolic representations are fast, accurate, and 
explainable. However, acquiring the rules for different tasks is difficult and needs 
extensive expert efforts. 

Starting in the late 1980s, statistical machine learning algorithms brought revolu- 
tion to NLP research. In statistical NLP systems, usually a piece of text is considered 
as a bag of its words, disregarding grammar and even word order but keeping multi- 
plicity (Manning and Schutze, 1999). Many of the notable early successes occurred 
in machine translation due to statistical models were developed. Statistical systems 
were able to take advantage of multilingual textual corpora. However, it is hard to 
model the semantic structure and information of human language by simply consid- 
ering the text as a bag of words. 

Since the early 2010s, the field of NLP has shifted to neural networks and deep 
learning, where word embeddings techniques such as Word2Vec (Mikolov T, 2013) 
or GloVe (Pennington et al, 2014) were developed to represent words as fixed vec- 
tors. We have also witnessed an increase in end-to-end learning for tasks such as 
question answering. Besides, by representing text as a sequence of word embedding 
vectors, different neural network architectures, such as vanilla recurrent neural net- 
works (Pascanu et al, 2013), Long Short-Term Memory (LSTM) networks (Greff 
et al, 2016), or convolutional neural networks (Dos Santos and Gatti, 2014), were 
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applied to model text. Deep learning has brought a new revolution in NLP, greatly 
improving the performance of various tasks. 

In 2018, Google introduced a neural network-based technique for NLP pre- 
training called Bidirectional Encoder Representations from Transformers (BERT) 
(Devlin et al, 2019). This model has enabled many NLP tasks to achieve superhu- 
man performance in different benchmarks and has spawned a series of follow-up 
studies on pre-training large-scale language models (Qiu et al, 2020b). In such ap- 
proaches, the representations of words are contextual sensitive vectors. By taking the 
contextual information into account, we can model the polysemy of words. How- 
ever, large-scale pre-trained language models require massive consumption of data 
and computing resources. Besides, existing neural network-based models lack ex- 
plainability or transparency, which can be a major drawback in health, education, 
and finance domains. 

Along with the evolving history of text representations and computational mod- 
els, from symbolic representations to contextual-sensitive embeddings, we can see 
an increase of semantical and structural information in text modeling. A key ques- 
tion is: how to further improve the representation of various text pieces and the 
computational models for different NLP tasks? We argue that representing text as 
graphs and applying graph neural networks to NLP applications is a highly promis- 
ing research direction. Graphs are of great significance to NLP research. The reasons 
are multi-aspect, which will be illustrated in the following. 

First, our world consists of things and the relations between them. The ability to 
draw logical conclusions about how different things are related to one another, or 
so-called relational reasoning, is central to both human and machine intelligence. In 
NLP, understanding human language also requires modeling different text pieces 
and reasoning over their relations. Graph provides a unified format to represent 
things and the relations between them. By modeling text as graphs, we can char- 
acterize the syntactic and semantic structures of different texts and perform explain- 
able reasoning and inference over such representations. 

Second, the structure of languages is intrinsically compositional, hierarchical, 
and flexible. From corpus to documents, paraphrases, sentences, phrases, and words, 
different text pieces form a hierarchical semantic structure, in which a higher-level 
semantic unit (e.g., a sentence) can be further decomposed into more fine-grained 
units (e.g., phrases and words). Such structural nature of human languages can be 
characterized by tree structures. Furthermore, due to the flexibility of languages, the 
same meaning can be expressed in different sentences, such as active and passive 
voices. However, we can unify the representation of varying sentences by seman- 
tic graphs like Abstract Meaning Representation (AMR) (Schneider et al, 2015) to 
make NLP models more robust. 

Last but not least, graphs have always been extensively utilized and formed an 
essential part of NLP applications ranging from syntax-based machine translation, 
knowledge graph-based question answering, abstract meaning representation for 
common sense reasoning tasks, and so on. On the other hand, with the vigorous 
research on graph neural networks, the recent research trend of combining graph 
neural networks and NLP has become more and more prosperous. Moreover, by uti- 
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lizing the general representation ability of graphs, we can incorporate multi-modal 
information (e.g., images or videos) to NLP, integrating different signals, modeling 
the world contexts and dynamics, and jointly learning multi-tasks. 

In this chapter, we present a brief overview of the status of graphs in NLP. We 
will introduce and categorize different graph representations adopted and show how 
NLP tasks can be mapped onto graph-based problems and solved by graph neural 
network-based approaches in Sec. 21.2. After that, we will discuss two case studies. 
The first case study in Sec. 21.3 introduces graph-based text clustering and match- 
ing for hot events discovery and organization. The second one in Sec. 21.4 presents 
graph-based multi-hop machine reading comprehension. We then provide a syn- 
thesis about the important open problems of this subfield in Sec. 22.7. Finally, we 
conclude this chapter in Sec. 21.6. 

Concurrently, a few very recent survey and tutorials (Wu et al, 2021c,b; Vashishth 
et al, 2019) aim to comprehensively introduce the historical and modern develop- 
ments of machine learning (especially deep learning) on graphs for NLP. In addi- 
tion, a recent released Graph4NLP library ! is the first and an easy-to-use library at 
the intersection of Deep Learning on Graphs and Natural Language Processing. It 
provides both full implementations of state-of-the-art models for data scientists and 
also flexible interfaces to build customized models for researchers and developers 
with whole-pipeline support. 


21.2 Modeling Text as Graphs 


In this section, we will provide an overview of different graph representations in 
NLP. After that, we will discuss how different NLP tasks can be tackled from a 
graph perspective. 


21.2.1 Graph Representations in Natural Language Processing 


Various graph representations have been proposed for text modeling. Based on the 
different types of graph nodes and edges, a majority of existing works can be gen- 
eralized into five categories: text graphs, syntactic graphs, semantic graphs, knowl- 
edge graphs, and hybrid graphs. 

Text graphs use words, sentences, paragraphs, or documents as nodes and estab- 
lish edges by word co-occurrence, location, or text similarities. Rousseau and Vazir- 
giannis (2013); Rousseau et al (2015) represented a document as graph-of-word, 
where nodes represent unique terms and directed edges represent co-occurrences 
between the terms within a fixed-size sliding window. Wang et al (2011) connected 
terms with syntactic dependencies. Schenker et al (2003) connected two words by 


1 Graph4NLP library can be accessed via this link https://github.com/graph4ai/ 
graph4nlp. 
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a directed edge if one word immediately precedes another word in the document 
title, body, or link. The edges are categorized by the three different types of linking. 
Balinsky et al (2011); Mihalcea and Tarau (2004); Erkan and Radev (2004) con- 
nected sentences if they near to each other, share at least one common keyword, or 
the sentence similarity is above a threshold. Page et al (1999) connected web docu- 
ments by hyperlinks. Putra and Tokunaga (2017) constructed directed graphs of sen- 
tences for text coherence evaluation. It utilized sentence similarities as weights and 
connects sentences with various constraints about sentence similarity or location. 
Text graphs can be established quickly, but they can not characterize the syntactic 
or semantic structure of sentences or documents. 

Syntactic graphs (or trees) emphasize the syntactical dependencies between 
words in a sentence. Such structural representations of sentences are achieved by 
parsing, which constructs the syntactic structure of a sentence according to a formal 
grammar. Constituency parsing tree and dependency parsing graph are two types 
of syntactic representations of sentences that use different grammars (Jurafsky, 
2000). Based on syntactic analysis, documents can also be structured. For exam- 
ple, Leskovec et al (2004) extracted subject-predicate-object triples from text based 
on syntactic analysis and merges them to form a directed graph. The graph was fur- 
ther normalized by utilizing WordNet (Miller, 1995) to merge triples belonging to 
the same semantic pattern. 

While syntactic graphs show the grammatical structure of text pieces, seman- 
tic graphs aim to represent the meaning being conveyed. A model of semantics 
could help disambiguate the meaning of a sentence when multiple interpretations 
are valid. Abstract Meaning Representation (AMR) graphs (Banarescu et al, 2013) 
are rooted, labeled, directed, acyclic graphs (DAGs), comprising whole sentences. 
Sentences that are similar in meaning will be assigned the same AMR, even if they 
are not identically worded. In this way, AMR graphs abstract away from syntactic 
representations. The nodes in an AMR graph are AMR concepts, which are either 
English words, PropBank framesets (Kingsbury and Palmer, 2002), or special key- 
words. The edges are approximately 100 relations, including frame arguments fol- 
lowing PropBank conventions, semantic relations, quantities, date-entities, lists, and 
so on. 

Knowledge graphs (KGs) are graphs of data intended to accumulate and convey 
knowledge of the real world. The nodes of a KG represent entities of interest, and 
the edges represent relations between these entities (Hogan et al, 2020). Prominent 
examples of KGs include DBpedia (Bizer et al, 2009), Freebase (Bollacker et al, 
2007), Wikidata (Vrandečić and Krotzsch, 2014) and YAGO (Hoffart et al, 2011), 
covering various domains. KGs are broadly applied for commercial use-cases, such 
as web search in Bing (Shrivastava, 2017) and Google (Singhal, 2012), commerce 
recommendation in Airbnb (Chang, 2018) and Amazon (Krishnan, 2018), and social 
networks like Facebook (Noy et al, 2019) and LinkedIn (He et al, 2016b). There are 
also graph representations that connect terms in a document to real-world entities or 
concepts based on KGs such as DBpedia (Bizer et al, 2009) and WordNet (Miller, 
1995). For example, Hensman (2004) identifies the semantic roles in a sentence with 
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WordNet and VerbNet, and combines these semantic roles with a set of syntactic 
rules to construct a concept graph. 

Hybrid graphs contain multiple types of nodes and edges to integrate hetero- 
geneous information. In this way, the various text attributes and relations can be 
jointly utilized for NLP tasks. Rink et al (2010) utilized sentences as nodes and en- 
codes lexical, syntactic, and semantic relations in edges. Jiang et al (2010) extracted 
tokens, syntactic structure nodes, semantic nodes and so on from each sentence and 
link them by different types of edges. Baker and Ellsworth (2017) built a sentence 
graph based on Frame Semantics and Construction Grammar. 


21.2.2 Tackling Natural Language Processing Tasks from a Graph 
Perspective 


Understanding natural language is essentially understanding different textual ele- 
ments and their relationships. Therefore, we can tackle different NLP tasks from 
a graph perspective based on the different representations we have introduced. In 
recent years, many research works apply graph neural networks (Wu et al, 2021d) 
to solve NLP problems. A majority of them are actually solving the following prob- 
lems: node classification, link prediction, graph classification, graph matching, com- 
munity detection, graph-to-text generation, and reasoning over graphs. 

For tasks focusing on assigning labels to words or phrases, they can be mod- 
eled as node classification. Cetoli et al (2017) showed that dependency trees play a 
positive role for named entity recognition by using a graph convolutional network 
(GCN) (Kipf and Welling, 2017b) to boost the results of a bidirectional LSTM. In 
(Gui et al, 2019), a GNN-based approach was proposed to alleviate the word ambi- 
guity in Chinese NER. Lexicons are used to construct the graph and provide word- 
level features. Yao et al (2019) proposed a text classification method termed Text 
Graph Convolutional Networks. It builds a heterogeneous word document graph for 
a whole corpus and turns document classification into a node classification problem. 

In addition to node classification, predicting the relationships between two el- 
ements is also an essential problem in NLP research, especially for knowledge 
graphs. Zhang and Chen (2018b) proposed a novel link prediction framework to 
simultaneously learn from local enclosing subgraphs, embeddings, and attributes 
based on graph neural networks. Rossi et al (2021) presented an extensive com- 
parative analysis on link prediction models based on KG embeddings. They found 
that the graph structural features play paramount effects on the effectiveness of link 
prediction models. Guo et al (2019d) introduced the Attention Guided Graph Con- 
volutional Networks (AGGCNs) for relation extraction tasks. The model operates 
directly on the full dependency trees and learns to distill the useful information 
from them in an end-to-end fashion. 

Graph classification techniques are applied to text classification problems to uti- 
lize the intrinsic structure of texts. In (Peng et al, 2018), a graph-CNN based deep 
learning model was proposed for text classification. It first converts texts to graph- 
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of-words and then utilizes graph convolution operations to convolve the word graph. 
Huang et al (2019a); Zhang et al (2020d) proposed graph-based methods for text 
classification, where each text owns its structural graph and text level word interac- 
tions can be learned. 

For NLP tasks involving a pair of text, graph matching techniques can be applied 
to incorporate the structural information of a text. Liu et al (2019a) proposed the 
Concept Interaction Graph to represent an article as a graph of concepts. It then 
matches a pair of articles by comparing the sentences that enclose the same concept 
node through a series of encoding techniques and aggregate the matching signals 
through a graph convolutional network. Haghighi et al (2005) represented sentences 
as directed graphs extracted from a dependency parser and develops a learned graph 
matching approach to approximating textual entailment. Xu et al (2019e) formu- 
lated the KB-alignment task as a graph matching problem, and proposed a graph 
attention-based approach. It first matches all entities in two KGs, and then jointly 
models the local matching information to derive a graph-level matching vector. 

Community detection provides a means of coarse-graining the complex interac- 
tions or relations between nodes, which is suitable for text clustering problems. For 
example, Liu et al (2017a, 2020a) described a news content organization system 
at Tencent which discovers events from vast streams of breaking news and evolves 
news story structures in an online fashion. They constructed a keyword graph and 
applied community detection over it to perform coarse-grained keyword-based text 
clustering. After that, they further constructed a document graph for each coarse- 
grained clusters, and applied community detection again to get fine-grained event- 
level document clusters. 

The task of graph-to-text generation aims at producing sentences that preserve 
the meaning of input graphs (Song et al, 2020b). Koncel-Kedziorski et al (2019) 
introduced a graph transforming encoder which can leverage the relational struc- 
ture of knowledge graphs and generate text from them. Wang et al (2020k); Song 
et al (2018) proposed graph-to-sequence models (Graph Transformer) to generate 
natural language texts from AMR graphs. Alon et al (2019a) leveraged the syntactic 
structure of programming languages to encode source code and generate text. 

Last but not least, reasoning over graphs plays a key role in multi-hop ques- 
tion answering (QA), knowledge-based QA, and conversational QA tasks. Ding 
et al (2019a) presented a framework CogQA to tackle multi-hop machine reading 
problem at scale. The reasoning process is organized as a cognitive graph, reaching 
entity-level explainability. Tu et al (2019) represented documents as a heterogeneous 
graph and employ GNN-based message passing algorithms to accumulate evidence 
on the proposed graph to solve the multi-hop reading comprehension problem across 
multiple documents. Fang et al (2020) created a hierarchical graph by constructing 
nodes on different levels of granularity (questions, paragraphs, sentences, entities), 
and proposed Hierarchical Graph Network (HGN) for multi-hop QA. Chen et al 
(2020n) dynamically constructed a question and conversation history aware context 
graph at each conversation turn and utilized a Recurrent Graph Neural Network and 
a flow mechanism to capture the conversational flow in a dialog. 
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Fig. 21.1: The story tree of “2016 U.S. presidential election”. Figure credit: Liu et al 
(2020a). 


In the following, we will present two case studies to illustrate how graphs and 
graph neural networks can be applied to different NLP tasks with more details. 


21.3 Case Study 1: Graph-based Text Clustering and Matching 


In this case study, we will describe the Story Forest intelligent news organization 
system designed for fine-grained hot event discovery and organization from web- 
scale breaking news (Liu et al, 2017a, 2020a). Story Forest has been deployed in the 
Tencent QQ Browser, a mobile application that serves more than 110 million daily 
active users. Specifically, we will see how a number of graph representations are 
utilized for fine-grained document clustering and document pair matching and how 
GNN contributes to the system. 


21.3.1 Graph-based Clustering for Hot Events Discovery and 
Organization 


In the fast-paced modern society, tremendous volumes of news articles are con- 
stantly being generated by different media providers, leading to information explo- 
sion. In the meantime, the large quantities of daily news stories that can cover differ- 
ent subjects and contain redundant or overlapping data are becoming increasingly 
difficult for readers to digest. Many news app users feel that they are overwhelmed 
by extremely repetitive information about a variety of current hot events while still 
struggling to get information about the events in which they are genuinely interested. 
Besides, search engines conduct document retrieval on the basis of user-entered re- 
quests. They do not, however, provide users with a natural way to view trending 
topics or breaking news. 
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Fig. 21.2: An overview of the system architecture of Story Forest. Figure credit: Liu 
et al (2020a). 


In (Liu et al, 2017a, 2020a), a novel news organization system named Story For- 
est was proposed to address the aforementioned challenges. The key idea of the 
Story Forest system is that, instead of providing users a list of web articles based on 
input queries, it proposes the concept of “event” and “story”, and propose to orga- 
nize tremendous of news articles into story trees to organize and track evolving hot 
events, revealing the relationships between them and reduce the redundancies. An 
event is a set of news articles reporting the same piece of real-world breaking news. 
And a story is a tree of related events that report a series of evolving real-world 
breaking news. 

Figure 21.1 presents an example of a story tree, which showcases the story of 
“2016 U.S. presidential election”. There are 20 nodes in the story tree. Each node 
indicates an event in the U.S. election in 2016, and each edge represents a temporal 
development relationship or a logical connection between two breaking news events. 
For example, event 1 is talking about Trump becomes a presidential candidate, and 
event 20 says Donald Trump is elected president. The index number on each node 
represents the event sequence over the timeline. The story tree contains 6 paths, 
where the main path | — 20 captures the process of the presidential election, the 
branch 3 — 6 describes Hilary’s health conditions, the branch 7 — 13 is focusing 
on the television debates, 14 — 18 are about “mail door” investigation, etc. As we 
can see, users can easily understand the logic of news reports and learn the key facts 
quickly by modeling the evolutionary and logical structure of a story into a story 
tree. 
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The story trees are constructed from web-scale news articles by the Story Forest 
system. The system’s architecture is shown in Fig. 21.2. It consists primarily of four 
components: preprocessing, keyword graph construction, clustering documents to 
events, and growing story trees with events. The overall process is split into eight 
stages. First, a range of NLP and machine learning tools will be used to process the 
input news document stream, including document filtering and word segmentation. 
Then the system extracts keywords, construct/update the co-occurrence graph of 
keywords, and divide the graph into sub-graphs. After that, it utilizes EventX, a 
graph-based fine-grained clustering algorithm to cluster documents into fine-grained 
events. Finally, the story trees (formed previously) are updated by either inserting 
each discovered event into an existing story tree at the right place or creating a new 
story tree if the event does not belong to any current story. 

We can observe from Fig. 21.2 that a variety of text graphs are utilized in 
the Story Forest system. Specifically, the EventX clustering algorithm is based on 
two types of text graphs: keyword co-occurrence graph and document relation- 
ship graph. The keyword co-occurrence graph connects two keywords if they co- 
occurred for more than n times in a news corpus, where n is a hyperparameter. On 
the other hand, the document relationship graph connects document pairs based on 
whether two documents are talking about the same event. Based on such two types 
of text graphs, EventX can accurately extract fine-grained document clusters, where 
each cluster contains a set of documents that focus on the same event. 

In particular, EventX performs two-layer graph-based clustering to extract events. 
The first layer performs community detection over the constructed keyword co- 
occurrence graph to split it into sub-graphs, where each sub-graph the keywords for 
a specific topic. The intuition for this step is that keywords related to a common topic 
usually will frequently appear in documents belonging to that topic. For example, 
documents belonging to the topic “2016 U.S. presidential election” will often men- 
tion keywords such as “Donald Trump”, “Hillary Clinton”, “election”, and so on. 
Therefore, highly correlated keywords will be linked to each other and form dense 
subgraphs, whereas keywords that are not highly related will have sparse or no links. 
The goal here is to extract dense keyword subgraphs linked to various topics. After 
obtaining the keyword subgraphs (or communities), we can assign each document 
to its most correlated keyword subgraph by calculating their TF-IDF similarity. At 
this point, we have grouped documents by topics in the first layer clustering. 

In the second layer, EventX constructs a document relationship graph for each 
topic obtained in the first layer. Specifically, a binary classifier will be applied to 
each pair of documents in a topic to detect whether two documents are talking about 
the same event. If yes, we connect the pair of documents. In this way, the set of 
documents in a topic turn into a document relationship graph. After that, the same 
community detection algorithm in the first layer will be applied to the document 
relationship graph, splitting it into sub-graphs where each sub-graph now represents 
a fine-grained event instead of a coarse-grained topic. Since the number of news 
articles belonging to each topic is significantly less after the first-layer document 
clustering, the graph-based clustering on the second layer is highly efficient, making 
it applicable for real-world applications. After extracting fine-grained events, we can 
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update the story trees by inserting an event to its related story or creating a new story 
tree if it doesn’t belong to any existing stories. We refer to (Liu et al, 2020a) for more 
details about the Story Forest system. 


21.3.2 Long Document Matching with Graph Decomposition and 
Convolution 


During the construction of the document relationship graph in the Story Forest sys- 
tem, a fundamental problem is determining whether two news articles are talking 
about the same event. It is a problem of semantic matching, which is a core research 
problem that lies at the core of many NLP applications, including search engines, 
recommender systems, news systems, etc. However, previous research about se- 
mantic matching is mainly designed for matching sentence pairs (Wan et al, 2016; 
Pang et al, 2016), e.g., for paraphrase identification, answer selection in question- 
answering, and so on. Due to the long length of news articles, such methods are not 
suitable and do not perform well on document matching (Liu et al, 2019a). 

To solve this challenge, Liu et al (2019a) presented a divide-and-conquer strategy 
to align a pair of documents and shift deep text comprehension away from the cur- 
rently dominant sequential modeling of language elements and toward a new level 
of graphical document representation that is better suited to longer articles. Specif- 
ically, Liu et al (2019a) proposed the Concept Interaction Graph (CIG) as a way to 
view a document as a weighted graph of concepts, with each concept node being 
either a keyword or a group of closely related keywords. Furthermore, two con- 
cept nodes will be connected by a weighted edge which indicates their interaction 
strength. 

As a toy example, Fig. 21.3 shows how to convert a document into a Concept In- 
teraction Graph (CIG). First, we extract keywords such as Rick, Morty, and Summer 
from the document using standard keyword extraction algorithms, e.g., TextRank 
(Mihalcea and Tarau, 2004). Second, similar to what we have done in the Story For- 
est system, we can group keywords into sub-graphs by community detection. Each 
keyword community turns into a “concept” in the document. After extracting con- 
cepts, we attach each sentence in the document to its most related concept node by 
calculating the similarities between a sentence and each concept. In Fig. 21.3, sen- 
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Fig. 21.4: An overview of our approach for constructing the Concept Interaction 
Graph (CIG) from a pair of documents and classifying it by Graph Convolutional 
Networks. Figure credit: Liu et al (2019a). 


tences 5 and 6 are mainly talking about the relationship between Rick and Summer, 
and are thus attached to the concept (Rick, Summer). Similarly, we can attach other 
sentences to nodes, decomposing the content of a document into a number of con- 
cepts. To construct edges, we represent each node’s sentence set as a concatenation 
of the sentences attached to it and measure the edge weight between any two nodes 
as the TF-IDF similarity between their sentence sets to create edges that show the 
correlation between different concepts. An edge will be removed if its weight is be- 
low a threshold. For a pair of documents, the process of converting them into a CIG 
is similar. The only differences are that the keywords are from both documents, and 
each concept node will have two sets of sentences from the two documents. As a re- 
sult, we have represented the original document (or document pair) with a graph of 
key concepts, each with a (or a pair of) sentence subset(s), as well as the interaction 
topology among them. 

The CIG representation of a document pair decomposes its content into multi- 
ple parts. Next, we need to match the two documents based on such representation. 
Fig. 21.4 illustrates the process of matching a pair of long documents. The matching 
process consists of four steps: a) preprocessing the input document pair and trans- 
form it into a CIG; b) matching the sentences from two documents over each node 
to get local matching features; c) structurally transforming local matching features 
by graph convolutional layers; and d) aggregating all the local matching features to 
get the final result. 

Specifically, for the local matching on each concept node, the inputs are the two 
sets of sentences from two documents. As each node only contains a small portion 
of the document sentences, the long text matching problems transform into short 
text matching on a number of concept nodes. In (Liu et al, 2019a), two different 
matching models are utilized: i) similarity-based matching, which calculate a vari- 
ety of text similarities between two set of sentences; ii) Siamese matching, which 
utilizes a Siamese neural network (Mueller and Thyagarajan, 2016) to encode the 


21 Graph Neural Networks in Natural Language Processing 475 


two sentence sets and get a local matching vector. After getting local matching re- 
sults, the next question is: how to get an overall matching score? Liu et al (2019a) 
aggregates the local matching vectors into a final matching score for the pair of ar- 
ticles by utilizing the ability of the graph convolutional network filters (Kipf and 
Welling, 2017b) to capture the patterns exhibited in the CIG at multiple scales. In 
particular, the local matching vectors of the concept nodes are transformed by multi- 
layer GCN layers to take the interaction structure between nodes (or concepts in two 
documents) into consideration. After getting the transformed feature vectors, they 
are aggregated by mean pooling to get a global matching vector. Finally, the global 
matching vector will be fed into a classifier (e.g., a feed-forward neural network) to 
get the final matching label or score. The local matching module, global aggregation 
module, and the final classification module are trained end-to-end. 

In (Liu et al, 2019a), extensive evaluations were performed to test the perfor- 
mance of the proposed approach for document matching. A key discovery made 
by (Liu et al, 2019a) is that the graph convolution operation significantly improves 
the performance of matching, demonstrating the effect of applying graph neural 
networks to the proposed text graph representation. The structural transformation 
on the matching vectors via GCN can efficiently capture the semantic interactions 
between sentences, and the transformed matching vectors better capture the seman- 
tic distance over each concept node by integrating the information of its neighbor 
nodes. 


21.4 Case Study 2: Graph-based Multi-Hop Reading 
Comprehension 


In this case study, we further introduce how graph neural networks can be applied to 
machine reading comprehension in NLP. Machine reading comprehension (MRC) 
aims to teach machines to read and understand unstructured text like a human. It is a 
challenging task in artificial intelligence and has great potential in various enterprise 
applications. We will see that by representing text as a graph and applying graph 
neural networks to it, we can mimic the reasoning process of human beings and 
achieve significant improvements for MRC tasks. 

Suppose we have access to a Wikipedia search engine, which can be utilized 
to retrieve the introductory paragraph para|x] of an entity x. How can we answer 
the question “Who is the director of the 2003 film which has scenes in it filmed 
at the Quality Cafe in Los Angeles?” with the search engine? Naturally, we will 
start with pay attention to related entities such as “Quality Cafe”, look up relevant 
introductions through Wikipedia, and quickly locate “Old School” and “Gone in 
60 Seconds” when it comes to Hollywood movies. By continuing to inquire about 
the introduction of the two movies, we further found their director. The last step 
is to determine which director it is. This requires us to analyze the semantics and 
qualifiers of the sentence. After knowing that the movie is in 2003, we can make the 
final judgment: “Todd Phillips” is the answer we want. Figure 21.5 illustrates such 
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process. Answering the aforementioned question requires multi-hop reasoning over 
different information, that is so-called multi-hop question answering. 

In fact, “pay attention to related entities quickly” and “analyze the meaning of 
sentences for inference” are two different thinking processes. In cognition, the well- 
known “dual process theory” (Kahneman, 2011) believes that human cognition is 
divided into two systems. System 1 is an implicit, unconscious and intuitive think- 
ing system. Its operation relies on experience and association. System 2 performs 
explicit, conscious and controllable reasoning process. This system uses knowledge 
in working memory to perform slow but reliable logical reasoning. System 2 is the 
embodiment of human advanced intelligence. 

Guided by the dual process theory, the Cognitive Graph QA (CogQA) framework 
was proposed in (Ding et al, 2019a). It adopts a directed graph structure, named 
cognitive graph, to perform step-by-step deduction and exploration in the cognitive 
process of multi-hop question answering. Figure 21.5 presents the cognitive graph 
for answering the previously mentioned question. Denote the graph as ¥, each node 
in Y represents an entity or possible answer x, also interchangeably denoted as node 
x. The solid black edges are the correct reasoning path to answer the question. The 
cognitive graph is constructed by an extraction module that acts like System 1. It 
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takes the introductory paragraph para|x] of entity x as input, and outputs answer 
candidates (i.e., ans nodes) and useful next-hop entities (i.e., hop nodes) from the 
paragraph. These new nodes gradually expand ¥Y, forming an explicit graph struc- 
ture for System 2 reasoning module. During the expansion of Y, the new nodes or 
existing nodes with new incoming edges bring new clue about the answer. Such 
nodes are referred as frontier nodes. For clue, it is a form-flexible concept, refer- 
ring to information from predecessors for guiding System | to better extract spans. 
To perform neural network-based reasoning over Y instead of rule-based, System 1 
also summarizes para|x] into an initial hidden representation vector when extract- 
ing spans, and System 2 updates all paragraphs’ hidden vectors X based on graph 
structure as reasoning results for downstream prediction. 

The procedure of the framework CogQA is as follows. First, the cognitive graph 
@ is initialized with the entities mentioned in the input question Q, and the entities 
are marked as initial frontier nodes. After initialization, a node x is popped from 
frontier nodes, and then a two-stage iterative process is conducted with two models 
F and A mimicking System 1 and System 2, respectively. In the first stage, the 
System 1 module in CoQA extracts question-relevant entities, answers candidates 
from paragraphs, and encodes their semantic information. Extracted entities are or- 
ganized as a cognitive graph, which resembles the working memory. Specifically, 
given x, CogQA collects clues|x,Y] from predecessor nodes of x, where the clues 
can be sentences where x is mentioned. It further fetches introductory paragraph 
para|x] in Wikipedia database W if any. After that, .Y, generates sem|x, Q, clues], 
which is the initial X, (i.e., the embedding of x). If x is a hop node, then .“ finds hop 
(e.g., entities) and answer spans in para|x]. For each hop span y, ify ¢ Y andy € Y, 
then create a a new hop node for y and add it to Y. If y € FY but edge(x,y) € Y, then 
add a new edge (x,y) to Y and mark node y as a frontier node, as it needs to be 
revisited with new information. For each answer span y, a new answer node y and 
edge (x,y) will be added to Y. In the second stage, System 2 conducts the reason- 
ing procedure over the graph and collects clues to guide System | to better extract 
next-hop entities. In particular, the hidden representation X of all paragraphs will be 
updated by .%. The above process is iterated until there is no frontier node in the 
cognitive graph (i.e., all possible answers are found) or the graph is large enough. 
Then the final answer is chosen with a predictor .F based on the reasoning results 
X from System 2. 

The CogQA framework can be implemented as the system in Fig. 21.6. It utilizes 
BERT (Devlin et al, 2019) as System 1 and GNN as System 2. For clues clues|x,Y], 
they are the sentences in paragraphs of x’s predecessor nodes, from which x is ex- 
tracted. We can observe from Fig. 21.6 that the input to BERT is the concatenation 
of the question, the clues passed from predecessor nodes, and the introductory para- 
graph of x. Based on these inputs, BERT outputs hop spans and answer spans, as 
well as uses the output at position 0 as sem|x, Q, clues]. 

For System 2, CogQA utilizes a variant of GNN to update the hidden representa- 
tions of all nodes. For each node x, its initial representation X, € IR" is the semantic 
vector sem|[x,Q,clues| from System | (i.e., BERT). The updating formula of the 
GNN layers are as follows: 
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A = 0((AD"!)'o(XW;)) (21.1) 
X' = 6 (XW) +A) (21.2) 


where X’ is the new hidden representations after a propagation step of GNN. 
W,,W2 € R’*" are weight matrices, o is the activation function. A € R”*" are ag- 
gregated vectors passed from neighbors in the propagation. A is the adjacent matrix 
of Y. It is column-normalized to AD~'!, where D is the degree matrix of Y. By left 
multiplying the transformed hidden vector o(XW)) with (AD~')', the GNN per- 
forms a localized spectral filtering. In the iterative step of visiting frontier node x, 
its hidden representation X, is updated following the above equations. 

Finally, a two-layer fully connected network (FCN) is utilized to serve as predic- 
tor F: 

answer = argmax ¥(X;) (21.3) 
answer node x 

In this way, one answer candidate can be selected as the final answer. In the Hot- 
potQA dataset (Yang et al, 2018b), there are also questions that aim to compare a 
certain property of entity x and y. Such questions are regarded as binary classifica- 
tion with input Xx — X, and solved by another identical FCNs. 
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The cognitive graph structure in the CogQA framework offers ordered and entity- 
level explainability and suits for relational reasoning, owing to the explicit reasoning 
paths in it. Aside from simple paths, it can also clearly display joint or loopy reason- 
ing processes, where new predecessors might bring new clues about the answer. As 
we can see, by modeling the context information as a cognitive graph and applying 
GNN to such representation, we can mimic the dual process of human perception 
and reasoning and achieve excellent performance on multi-hop machine reading 
comprehension tasks, as demonstrated in (Ding et al, 2019a). 


21.5 Future Directions 


Applying graph neural networks to NLP tasks with suitable graph representations 
for text can bring significant benefits, as we have discussed and shown through 
the case studies. Although GNNs have achieved outstanding performance in many 
tasks, including text clustering, classification, generation, machine reading compre- 
hension and so on, there are still numerous open problems to solve at the moment 
to better understand human language with graph-based representations and models. 
In particular, here we categorize and discuss the open problems or future directions 
for graph-based NLP in terms of five aspects: model design of GNNs, data rep- 
resentation learning, multi-task relationship modeling, world model, and learning 
paradigm. 

Although several GNN models are applicable to NLP tasks, only a small subset of 
them is explored for model design. More advanced GNN models can be utilized or 
improved to handle the scale, depth, dynamics, heterogeneity, and explainability of 
natural language texts. First, scaling GNNs to large graphs helps to utilize resources 
such as large-scale knowledge graphs better. Second, most GNN architectures are 
shallow, and the performance drops after two to three layers. Design deeper GNNs 
enables node representation learning with information from larger and more adap- 
tive receptive fields (Liu et al, 2020c). Third, we can utilize dynamic graphs to model 
the evolving or temporal phenomenons in texts, e.g., the development of stories or 
events. Correspondingly, dynamic or temporal GNNs (Skarding et al, 2020) can help 
capture the dynamic nature in specific NLP tasks. Forth, the syntactic, semantic, as 
well as knowledge graphs in NLP are essentially heterogeneous graphs. Developing 
heterogeneous GNNs (Wang et al, 2019i; Zhang et al, 2019b) can help better utiliz- 
ing the various nodes and edge information in text and understanding its semantic. 
Last but not least, the need for improved explainability, interpretability, and trust of 
AI systems in general demands principled methodologies. One way is using GNNs 
as a model of neural-symbolic computing and reasoning (Lamb et al, 2020), as the 
data structure and reasoning process can be naturally captured by graphs. 

For data representations, most existing GNNs can only learn from input when 
a graph-structure of input data is available. However, real-world graphs are often 
noisy and incomplete or might not be available at all. Designing effective models 
and algorithms to automatically learn the relational structure in input data with lim- 
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ited structured inductive biases can efficiently solve this problem. Instead of man- 
ually designing specific graph representations of data for different applications, we 
can enable models to automatically identify the implicit, high-order, or even casual 
relationships between input data points, and learn the graph structure and repre- 
sentations of inputs. To achieve these, recent research on graph pooling (Lee et al, 
2019b), graph transformers (Yun et al, 2019), and hypergraph neural networks (Feng 
et al, 2019c) can be applied and further explored. 

Multi-task learning (MTL) in deep neural networks for NLP has recently re- 
ceived increasing interest as it has the potential to efficiently regularize models and 
to reduce the need for labeled data (Bingel and Søgaard, 2017). We can marriage the 
representation power of graph structures with multi-task learning to integrate diverse 
input data, such as images, text pieces, and knowledge bases, and jointly learn a uni- 
fied and structured representation for various tasks. Furthermore, we can learn the 
relationships or correlations between different tasks and exploit the learned relation- 
ship for curriculum learning to accelerate the convergence rate for model training. 
Finally, with the unified graph representation and integration of different data, as 
well as the joint and curriculum learning of different tasks, NLP or AI systems will 
gain the ability to continually acquire, fine-tune, and transfer knowledge and skills 
throughout their lifespan. 

Grounded language learning or acquisition (Matuszek, 2018; Hermann et al, 
2017) is another trending research topic that aims at learning the meaning of lan- 
guage as it applies to the physical world. Intuitively, language can be better learned 
when presented and interpreted in the context of the world it pertains to. It has 
been demonstrated that GNNs can efficiently capture joint dependencies between 
different elements in the world (Li et al, 2017e). Besides, they can also efficiently 
utilize the rich information in multiple modalities of the world to help understand 
the meaning of scene texts (Gao et al, 2020a). Therefore, representing the world 
or environment with graphs and GNNs to improve the understanding of languages 
deserves more research endeavors. 

Lastly, research about self-supervised pre-training for GNNs is also attracting 
more attention. Self-supervised representation learning leverages input data itself 
as supervision and benefits almost all types of downstream tasks (Liu et al, 2020f). 
Numerous successful self-supervised pre-training strategies, such as BERT (Devlin 
et al, 2019) and GPT (Radford et al, 2018) have been developed to tackle a variety 
of language tasks. For graph learning, when task-specific labeled data is extremely 
scarce, or the graphs in the training set are structurally very different from graphs 
in the test set, pre-training GNNs can serve as an efficient approach for transfer 
learning on graph-structured data (Hu et al, 2020c). 


21.6 Conclusions 


Over the past few years, graph neural networks have become powerful and practical 
tools for a variety of problems that can be modeled by graphs. In this chapter, we 
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did a comprehensive overview of combining graph representations and graph neural 
networks in NLP tasks. We introduced the motivation of applying graph representa- 
tions and GNNs to NLP problems through the developing history of NLP research. 
After that, we provided a brief overview of various graph representations in NLP, 
as well as discussed how to tackle different NLP tasks from a graph perspective. To 
illustrate how graphs and GNNs are applied in NLP applications with more details, 
we presented two case studies related to graph-based hot event discovery and multi- 
hop machine reading comprehension. Finally, we categorized and discussed several 
frontier research and open problems for graph-based NLP. 


Editor’s Notes: Graph-based methods for Natural Language Processing 
have been long studied over the last two decades. Indeed, the human lan- 
guage is high-level symbol and thus there are rich hidden structural infor- 
mation beyond the original simple text sequence. In order to make full use 
of GNNs for NLP, graph structure learning techniques in Chapter 14 and 
GNN Methods in Chapter 4 serve as the two fundamental building blocks. 
Meanwhile, GNN Scalability in Chapter 6, Heterogeneous GNNs in Chap- 
ter 16, GNN Robustness in Chapter 8, and so on are also highly important 
for developing an effective and efficient approach with GNNs for various 
NLP applications. This chapter is also highly correlated with the Chapter 
20 (GNN for CV) since vision and language is a fast-growing research area 
and multi-modality data is widely used today. 
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Chapter 22 N 
Graph Neural Networks in Program Analysis 


Miltiadis Allamanis 


Abstract Program analysis aims to determine if a program’s behavior complies 
with some specification. Commonly, program analyses need to be defined and tuned 
by humans. This is a costly process. Recently, machine learning methods have 
shown promise for probabilistically realizing a wide range of program analyses. 
Given the structured nature of programs, and the commonality of graph represen- 
tations in program analysis, graph neural networks (GNN) offer an elegant way to 
represent, learn, and reason about programs and are commonly used in machine 
learning-based program analyses. This chapter discusses the use of GNNs for pro- 
gram analysis, highlighting two practical use cases: variable misuse detection and 
type inference. 


22.1 Introduction 


Program analysis is a widely studied area in programming language research that 
has been an active and lively research domain for decades with many fruitful re- 
sults. The goal of program analysis is to determine properties of a program with 
regards to its behavior (Nielson et al, 2015). Traditionally analysis methods aim to 
provide formal guarantees about some program property e.g., that the output of a 
function always satisfies some condition, or that a program will always terminate. 
To provide those guarantees, traditional program analysis relies on rigorous math- 
ematical methods that can deterministically and conclusively prove or disprove a 
formal statement about a program’s behavior. 

However, these methods cannot learn to employ coding patterns or probabilisti- 
cally handle ambiguous information that is abundant in real-life code and is widely 
used by coders. For example, when a software engineer encounters a variable named 
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“counter”, without any additional context, she/he will conclude with a high proba- 
bility that this variable is a non-negative integer that enumerates some elements or 
events. In contrast, a formal program analysis method — having no additional con- 
text — will conservatively conclude that “counter” may contain any value. 

Machine learning-based program analysis (Section 22.2) aims to provide this 
human-like ability to learn to reason over ambiguous and partial information at the 
cost of foregoing the ability to provide (absolute) guarantees. Instead, through learn- 
ing common coding patterns, such as naming conventions and syntactic idioms, 
these methods can offer (probabilistic) evidence about aspects of the behavior of a 
program. This is not to say that machine learning makes traditional program analy- 
ses redundant. Instead, machine learning provides a useful weapon in the arsenal of 
program analysis methodologies. 

Graph representations of programs play a central role in program analysis and al- 
low reasoning over the complex structure of programs. Section 22.3 illustrates one 
such graph representation which we use throughout this and discusses alternatives. 
We then discuss GNNs which have found a natural fit for machine learning-based 
program analyses and relate them to other machine learning models (Section 22.4). 
GNNs allow us to represent, learn, and reason over programs elegantly by integrat- 
ing the rich, deterministic relationships among program entities with the ability to 
learn over ambiguous coding patterns. In this , we discuss how to approach two prac- 
tical static program analyses using GNNs: bug detection (Section 22.5), and prob- 
abilistic type inference (Section 22.6). We conclude this (Section 22.7) discussing 
open challenges and promising new areas of research in the area. 


22.2 Machine Learning in Program Analysis 


Before discussing program analysis with GNNs, it is important to take a step back 
and ask where machine learning can help program analysis and why. At a first look 
these two fields seem incompatible: static program analyses commonly seek guar- 
antees (e.g., a program never reaches some state) and dynamic program analyses 
certify some aspect of a program’s execution (e.g., specific inputs yield expected 
outputs), whereas machine learning models probabilities of events. 

At the same time, the burgeoning area of machine learning for code (Allamanis 
et al, 2018a) has shown that machine learning can be applied to source code across 
a series of software engineering tasks. The premise is that although code has a de- 
terministic, unambiguous structure, humans write code that contains patterns and 
ambiguous information (e.g. comments, variable names) that is valuable for under- 
standing its functionality. It is this phenomenon that program analysis can also take 
advantage of. 

There are two broad areas where machine learning can be used in program anal- 
ysis: learning proof heuristics, and learning static or dynamic program analyses. 
Commonly static program analyses resort into converting the analysis task into a 
combinatorial search problem, such as a Boolean satisfiability problem (SAT), or 
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another form of theorem proving. Such problems are known to often be computa- 
tionally intractable. Machine learning-based methods, such as the work of (Irving 
et al, 2016) and (Selsam and Bjgrner, 2019) have shown the promise that heuris- 
tics can be learned to guide combinatorial search. Discussing this exciting area of 
research is out-of-scope for this . Instead, we focus on the static program analysis 
learning problem. 

Conceptually, a specification defines a desired aspect of a program’s functionality 
and can take many forms, from natural language descriptions to formal mathemati- 
cal constructs. Traditional static program analyses commonly resort to formulating 
program analyses through rigorous formal methods and dynamic analyses through 
observations of program executions. However, defining such program analyses is a 
tedious, manual task that can rarely scale to a wide range of properties and programs. 
Although it is imperative that formal methods are used for safety-critical applica- 
tions, there is a wide range of applications that miss on the opportunity to benefit 
from program analysis. Machine learning-based program analysis aims to address 
this, but sacrifice the ability to provide guarantees. Specifically, machine learning 
can help program analyses deal with the two common sources of ambiguities: latent 
specifications, and ambiguous execution contexts (e.g., due to dynamically loaded 
code). Program analysis learning commonly takes one of three forms, discussed 
next. 


Specification Tuning where an expert writes a sound program analysis which may 
yield many false positives (false alarms). Raising a large number of false alarms 
leads to the analogue of Aesop’s “The Boy who Cried Wolf”: too many false alarms, 
lead to true positives getting ignored, diminishing the utility of the analysis. To ad- 
dress this, work such as those of (Raghothaman et al, 2018) and (Mangal et al, 
2015) use machine learning methods to “tune” (or post-process) a program analy- 
sis by learning which aspects of the formal analysis can be discounted, increasing 
precision at the cost of recall (soundness). 


Specification Inference where a machine learning model is asked to learn to pre- 
dict a plausible specification from existing code. By making the (reasonable) as- 
sumption that most of the code in a codebase complies with some latent specifica- 
tion, machine learning models are asked to infer closed forms of those specifica- 
tions. The predicted specifications can then be input to traditional program analyses 
that check if a program satisfies them. Examples of such models are the factor graphs 
of (Kremenek et al, 2007) for detecting resource leaks, the work of (Livshits et al, 
2009) and (Chibotaru et al, 2019) for information flow analysis, the work of (Si 
et al, 2018) for generating loop invariants, and the work of (Bielik et al, 2017) for 
synthesizing rule-based static analyzers from examples. The type inference problem 
discussed in Section 22.6 is also an instance of specification inference. 

Weaker specifications — commonly used in dynamic analyses — can also be in- 
ferred. For example, Ernst et al (2007) and Hellendoorn et al (2019a) aim to predict 
invariants (assert statements) by observing the values during execution. Tufano et al 
(2020) learn to generate unit tests that describe aspects of the code’s behavior. 
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Black Box Analysis Learning where the machine learning model acts as a black 
box that performs the program analysis and raises warnings but never explicitly for- 
mulates a concrete specification. Such forms of program analysis have great flexi- 
bility and go beyond what many traditional program analyses can do. However, they 
often sacrifice explainability and provide no guarantees. Examples of such methods 
include DeepBugs (Pradel and Sen, 2018), Hoppity (Dinella et al, 2020), and the 
variable misuse problem (Allamanis et al, 2018b) discussed in Section 22.5. 

In Section 22.5 and 22.6, we showcase two learned program analyses using 
GNNs. However, we first need to discuss how to represent programs as graphs (Sec- 
tion 22.3) and how to process these graphs with GNNs (Section 22.4). 


22.3 A Graph Represention of Programs 


Many traditional program analysis methods are formulated over graph represen- 
tations of programs. Examples of such representations include syntax trees, con- 
trol flow, data flow, program dependence, and call graphs each providing different 
views of a program. At a high level, programs can be thought as a set of heteroge- 
neous entities that are related through various kinds of relations. This view directly 
maps a program to a heterogeneous directed graph Y = (V,&), with each entity 
being represented as a node and each relationship of type r represented as an edge 
(vir,vj) € £. These graphs resemble knowledge bases with two important differ- 
ences (1) nodes and edges can be deterministically extracted from source code and 
other program artifacts (2) there is one graph per program/code snippet. 

However, deciding which entities and relations to include in a graph represen- 
tation of a program is a form of feature engineering and task-dependent. Note that 
there is no unique or widely accepted method to convert a program into a graph 
representation; different representations offer trade-offs between expressing various 
program properties, the size of the graph representation, and the (human and com- 
putational) effort required to generate them. 

In this section we illustrate one possible program graph representation inspired 
by (Allamanis et al, 2018b), who model each source code file as a single graph. 
We discuss other graph representations at the end of this section. Figure 22.1 shows 
the graph for a hand-crafted synthetic Python code snippet curated to illustrate a 
few aspects of the graph representation. A high-level explanation of the entities 
and relations follows; for a detailed overview of the relevant concepts, we refer the 
reader to programming language literature, such as the compiler textbook of (Aho 
et al, 2006). 


Tokens A program’s source code is at its most basic form a string of characters. By 
construction programming languages can be deterministically tokenized (lexed) into 
a sequence of tokens (also known as lexemes). Each token can then be represented 
as a node (white boxes with gray border in Figure 22.1) of “token” type. These 
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1 def normalize_and_encode ( content ,  max_len 
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7 return bytes_encode ( content ) 
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Fig. 22.1: A heterogeneous graph representation of a simple synthetic Python pro- 
gram (some nodes omitted for visual clarity). Source code is represented as a het- 
erogeneous graph with typed nodes and edges (shown at the bottom of the figure). 
Code is originally made of tokens (token nodes) which can deterministically be 
parsed into a syntax tree with non-terminal nodes (vertexes). The symbols present 
in the snippet (e.g. variables) can then be computed (Symbol nodes) and each refer- 
ence of symbol denoted by an OccurenceOf edge. Finally, dataflow edges can be 
computed (MayNextUse) to indicate the possible flows of values in the program. 
Note, the snippet here contains a bug in line 4 (see Section 22.5). 


nodes are connected with a NextToken edge (not shown in Figure 22.1) to form a 
linear chain. 


Syntax The sequence of tokens is parsed into a syntax tree. The leafs of the tree 
are the tokens and all other nodes of the tree are “syntax nodes” (Figure 22.1; grey 
blue rounded boxes). Using edges of Child type all syntax nodes and tokens are con- 
nected to form a tree structure. This stucture provides contextual information about 
the syntactical role of the tokens, and groups them into expressions and statements; 
core units in program analysis. 


Symbols Next, we introduce “symbol” nodes (Figure 22.1; black boxes with 
dashed outline). Symbols in Python are the variables, functions, packages that are 
available at a given scope of a program. Like most compilers and interpreters, after 
parsing the code, Python creates a symbol table containing all the symbols within 
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each file of code. For each symbol, a node is created. Then, every identifier token 
(e.g., the content tokens in Figure 22.1) or expression node is connected to the sym- 
bol node it refers to. Symbol nodes act as a central point of reference among the 
uses of variables and are useful for modeling the long-range relationships (e.g., how 
an object is used). 


Data Flow To convey information about the program execution we add data flow 
edges to the graph (dotted curved lines in Figure 22.1) using an intraprocedural 
dataflow analysis. Although, the actual data flow within the program during execu- 
tion is unknown due to the use of branching in loops and if statements, we can add 
edges indicating all the valid paths that data may flow through the program. Take as 
an example the parameter min_len in Figure 22.1. If the condition in line 3 is true, 
then min_len will be accessed in line 4, but not in line 5. Conversely, if the condition 
in line 3 is false, then the program will proceed to line 5, where min_len will be 
accessed. We denote this information with a MayNextUse edge. This construction 
resembles a program dependence graph (PDG) used in compilers and conventional 
program analyses. In contrast to the edges previously discussed, MayNextUse has a 
different flavor. It does not indicate a deterministic relationship but sketches all pos- 
sible data flows during execution. Such relationships are central in program analyses 
where existential or universal properties of programs need to be computed. For ex- 
ample, a program analysis may need to compute that for all (V) possible execution 
paths some property is true, or that there exists (4) at least one possible execution 
with some property. 

It is interesting to observe that just using the token nodes and NextToken edges 
we can (deterministically) compute all other nodes and edges. Compilers do ex- 
actly that. Then why introduce those additional nodes and edges and not let a neural 
network figure them out? Extracting such graph representations is cheap computa- 
tionally and can be performed using the compiler/interpreter of the programming 
language without substantial effort. By directly providing this information to ma- 
chine learning models — such as GNNs — we avoid “spending” model capacity for 
learning deterministic facts and introduce inductive biases that can help on program 
analysis tasks. 


Alternative Graph Representations So far we presented a simplified graph rep- 
resentation inspired from (Allamanis et al, 2020). However, this is just one possi- 
ble representation among many, that emphasizes the local aspects of code, such as 
syntax, and intraprocedural data flow. These aspects will be useful for the tasks dis- 
cussed in Sections 22.5 and 22.6. Others entities and relationships can be added, in 
the graph representation of Figure 22.1. For example, Allamanis et al (2018b) use a 
GuardedBy edge type to indicate that a statement is guarded by a condition (i.e., it 
is executed only when the condition is true), and Cvitkovic et al (2018) use a Subto- 
kenOf edge to connect tokens to special subtoken nodes indicating that the nodes 
share a common subtoken (e.g., the tokens max_len and min_len in Figure 22.1 share 
the len subtoken). 

Representations such as the one presented here are local, i.e. emphasize the local 
structure of the code and allow detecting and using fine-grained patterns. Other local 
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representations, such as the one of (Cummins et al, 2020) emphasize the data and 
control flow removing the rich natural language information in identifiers and com- 
ments, which is unnecessary for some compiler program analysis tasks. However, 
such local representations yield extremely large graphs when representing multiple 
files and the graphs become too large for current GNN architectures to meaningfully 
process (e.g., due to very long distances among nodes). Although a single, general 
graph representation that includes every imaginable entity and relationship would 
seem useful, existing GNNs would suffer to process the deluge of data. Neverthe- 
less, alternative graph constructions that emphasize different program aspects are 
found in the literature and provide different trade-offs. 

One such representation is the global hypergraph representation of (Wei et al, 
2019) that emphasizes the inter- and intraprocedural type constraints among expres- 
sions in a program, ignoring information about syntactic patterns, control flow, and 
intraprocedural data flow. This allows processing whole programs (instead of single 
files; as in the representation of Figure 22.1) in a way that is suitable for predicting 
type annotations, but misses the opportunity to learn from syntactic and control-flow 
patterns. For example, it would be hard argue for using this representation for the 
variable misuse bug detection discussed in Section 22.5. 

Another kind of graph representations is the extrinsic one defined by (Abde- 
laziz et al, 2020) who combine syntactic and semantic information of programs 
with metadata such as documentation and content from question and answer (Q&A) 
websites. Such representations often de-emphasize aspects of the code structure fo- 
cusing on other natural language and social elements of software development. Such 
a representation would be unsuitable for the program analyses of Sections 22.5 and 
220. 


22.4 Graph Neural Networks for Program Graphs 


Given the predominance of the graph representations for code, a variety of ma- 
chine learning techniques has been employed for program analyses over program 
graphs, well before GNNs got established in the machine learning community. In 
these methods, we find some of the origins and motivations for GNNs. 

One popular approach has been to project the graph into another simpler repre- 
sentation that other machine learning methods can accept as input. Such projections 
include sequences, trees, and paths. For example, Mir et al (2021) encode the se- 
quences of tokens around each variable usage to predict its type (as in the usecase 
of Section 22.6). Sequence-based models offer great simplicity and have good com- 
putational performance but may miss the opportunity to capture complex structural 
patterns such as data and control flow. 

Another successful representation is the extraction of paths from trees or graphs. 
For example, Alon et al (2019a) extract a sample of the paths between every two 
terminal nodes in an abstract syntax tree, which resembles random walk meth- 
ods (Vishwanathan et al, 2010). Such methods can capture the syntactic informa- 
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tion and learn to derive some of code’s semantic information. These paths are easy 
to extract and provide useful features to learn about code. Nevertheless, they are 
lossy projections of the entities and relations within a program, that a GNN can — in 
principle — use in full. 

Finally, factor graphs, such as conditional random fields (CRF) work directly on 
graphs. Such models commonly include carefully constructed graphs that capture 
only the relevant relationships. The most prominent example in program analysis 
includes the work of Raychev et al (2015) that captures the type constraints among 
expressions and the names of identifiers. While such models accurately represent 
entities and relationships, they commonly require manual feature engineering and 
cannot easily learn “soft” patterns beyond those explicitly modeled. 


Graph Neural Networks GNNs rapidly became a valuable tool for learned pro- 
gram analyses given their flexibility to learn from rich patterns and the easiness 
of combining them with other neural network components. Given a program graph 
representation, GNNs compute the network embeddings for each node, to be used 
for downstream tasks, such as those discussed in Section 22.5 and 22.6. First, each 
entity/node v; is embedded into a vector representation ny, Program graphs have 
rich and diverse information in their nodes, such as meaningful identifier names 
(e.g. max_len). To take advantage of the information within each token and symbol 
node, its string representation is subtokenized (e.g. “max”, “len”) and each initial 
node representation ny, is computed by pooling the embeddings of the subtokens, 
i.e., for a node v; and for sum pooling, the input node representation is computed as 


n, = > t; 


sESUBTOKENIZE(v;) 


where t, is a learned embedding for a subtoken s. For syntax nodes, their initial 
state is the embedding of the type of the node. Then, any GNN architecture that 
can process directed heterogeneous graphs! can be used to compute the network 
embeddings, i.e., 


{h,,} = GNN (4, {n,,}) , (22.1) 


where the GNN commonly has a fixed number of “layers” (e.g. 8), GY = (V,&U 
Gin), and Giny is the set of inverse edges of &, i.e., Gin = {(vj, r! vi), Vv r, vj) € E}. 
The network embeddings {h,,} are then the input to a task-specific neural network. 
We discuss two tasks in the next sections. 


1 GGNNs (Li et al, 2016b) have historically been a common option, but other architectures have 
shown improvements (Brockschmidt, 2020) over plain GGNNSs for some tasks. 
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22.5 Case Study 1: Detecting Variable Misuse Bugs 


We now focus on a black box analysis learning problem that utilizes the graph rep- 
resentation discussed in the previous section. Specifically, we discuss the variable 
misuse task, first introduced by (Allamanis et al, 2018b) but employ the formulation 
of (Vasic et al, 2018). A variable misuse is the incorrect use of one variable instead 
of another already in the scope. Figure 22.1 contains such a bug in line 4, where 
instead of min_len, the max_len variable needs to be used to correctly truncate the 
content. To tackle this task a model needs to first localize (locate) the bug (if one 
exists) and then suggest a repair. 

Such bugs happen frequently, often due to careless copy-paste operations and can 
often be though as “typos”. Karampatsis and Sutton (2020) find that more than 12% 
of the bugs in a large set of Java codebases are variable misuses, whereas Tarlow et al 
(2020) find 6% of Java build errors in the Google engineering systems are variable 
misuses. This is a lower bound, since the Java compiler can only detect variable 
misuse bugs though its type checker. The author conjectures — from his personal 
experience — that many more variable misuse bugs arise during code editing and 
are resolved before being committed to a repository. 

Note that this is a black box analysis learning task. No explicit specification 
of what the user tries to achieve exists. Instead the GNN needs to infer this from 
common coding patterns, natural language information within comments (like the 
one in line 2; Figure 22.1) and identifier names (like min, max, and len) to reason 
about the presence of a likely bug. In Figure 22.1 it is reasonable to assume that the 
developer’s intent is to truncate content to max_len when it exceeds that size (line 
4). Thus, the goal of the variable misuse analysis is to (1) localize the bug (if one 
exists) by pointing to the buggy node (the min_len token in line 4), and (2) suggest 
a repair (the max_len symbol). 

To achieve this, assume that a GNN has computed the network embeddings {hy, } 
for all nodes v; € Y in the program graph Y (Equation 22.1). Then, let u C V be 
the set of token nodes that refer to variable usages, such as the min_len token in line 
4 (Figure 22.1). First, a localization module aims to pinpoint which variable usage 
(if any) is a variable misuse. This is implemented as a pointer network (Vinyals 
et al, 2015) over %, U {0} where Ø denotes the “no bug” event with a learned hg 
embedding. Then using a (learnable) projection u and a softmax, we can compute 
the probability distribution over %,,, and the special “no bug” event, 


oc(Vi) = softmax (u'h,,) 22:2 
Ploc(v ) veka UO) i ( ) 


In the case of Figure 22.1, a GNN detecting the variable misuse bug in line 4, would 
assign a high Pioc to the node corresponding to the min_len token, which is the 
location of the variable misuse bug. During (supervised) training the loss is simply 
the cross-entropy classification loss of the probability of the ground-truth location 
(Equation 22.2). 
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i def describe_identity_pool (self, identity_pool_id): 
identity_pool = self.identity_pools.get (identity_pool_id, None) 


4 if not identity_pool: 
g= raise ResourceNotFoundError (identity_pool) 
6+ raise ResourceNotFoundError (identity_pool_id) 


Fig. 22.2: A diff snippet of code with a real-life variable misuse error caught by a 
GNN-based model in the https: //github.com/spulec/moto open-source 
project. 


Repair given the location of a variable misuse bug can also be represented as a 
pointer network over the nodes of the symbols that are in scope at the variable mis- 
use location Vpug. We define Vo@vyug as the set of the symbol nodes of the alternative 
candidate symbols that are in scope at Vbug, except from the symbol node of Vpug. 
In the case of Figure 22.1 and the bug in line 4, Vs@ ving would contain the content 
and max_len symbol nodes. We can then compute the probability of repairing the 
localized variable misuse bug with the symbol s; as 


Prep(Si) = softmax (w [Pvyyg>Hsi]) i 


sjE Vs@ ving 


i.e., the softmax of the concatenation of the node embeddings of vpug and si, pro- 
jected onto a w (i.e., a linear layer). For the example of Figure. 22.1, prep(s;) should 
be high for the symbol node of max_len, which is the intended repair for the vari- 
able misuse bug. Again, in supervised training, we minimize the cross-entropy loss 
of the probability of the ground-truth repair. 


Training When a large dataset of variable misuse bugs and the relevant fixes can 
be mined, the GNN-based model discussed in this section can be trained in a super- 
vised manner. However, such datasets are hard to collect at the scale that existing 
deep learning methods require to achieve reasonable performance. Instead work in 
this area has opted to automatically insert random variable misuse bugs in code 
scraped from open-source repositories — such as GitHub — and create a corpus of 
randomly inserted bugs (Vasic et al, 2018; Hellendoorn et al, 2019b). However, the 
random generation of buggy code needs to be carefully performed. If the randomly 
introduced bugs are “too obvious’, the learned models will not be useful. For exam- 
ple, random bug generators should avoid introducing a variable misuse that causes 
a variable to be used before it is defined (use-before-def). Although such randomly 
generated corpora are not entirely representative of real-life bugs, they have been 
used to train models that can catch real-life bugs. 

When evaluating variable misuse models — like those presented in this section 
— they achieve relatively high accuracy over randomly generated corpora with ac- 
curacies of up to 75% (Hellendoorn et al, 2019b). However, in the author’s experi- 
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ence for real-life bugs — while some variable misuse bugs are recalled — precision 
tends to be low making them impractical for deployment. Improving upon this is 
an important open research problem. Nevertheless, actual bugs have been caught in 
practice. Figure 22.2 shows such an example caught by a GNN-based variable mis- 
use detector. Here, the developer incorrectly passed identity_pool instead of iden- 
tity_pool_id as the exception argument when identity_pool was None (no pool with 
the requested id could be found). The GNN-based black-box analysis seems to have 
learned to “understand” that it is unlikely that the developer’s intention is to pass 
None to the ResourceNotFoundError constructor and instead suggests that it should 
be replaced by identity _pool_id. This is without ever formulating a formal specifica- 
tion or creating a symbolic program analysis rule. 


22.6 Case Study 2: Predicting Types in Dynamically Typed 
Languages 


Types are one of the most successful innovations in programming languages. Specif- 
ically, type annotations are explicit specifications over the valid values a variable can 
take. When a program type checks, we get a formal guarantee that the values of vari- 
ables will only take the values of the annotated type. For example, if a variable has 
an int annotation, it must contain integers but not strings, floats, etc. Furthermore, 
types can help coders understand code more easily and software tools such as auto- 
completion and code navigation to be more precise. However, many programming 
languages either have to decide to forgo the guarantees provided by types or require 
their users to explicitly provide type annotations. 

To overcome these limitations, specification inference methods can be used to 
predict plausible type annotations and bring back some of the advantages of typed 
code. This is especially useful in code with partial contexts (e.g., a standalone snip- 
pet of code in a webpage) or optionally typed languages. This section looks into 
Python, which provides an optional mechanism for defining type annotations. For 
example, content in Figure 22.1 can be annotated as content: str in line 1 to indi- 
cate that the developer expects that it will only contain string values. These annota- 
tions can then be used by type checkers, such as mypy (mypy Contributors, 2021) 
and other developer tools and code editors. This is the probabilistic type inference 
problem, first proposed by (Raychev et al, 2015). Here we use the GRAPH2CLASS 
GNN-based formulation of (Allamanis et al, 2020) treating this as a classification 
task over the symbols of the program similar to (Hellendoorn et al, 2018). Pandi 
et al (2020) offer an alternative formulation of the problem. 

For type checking methods to operate explicit types annotations need to be pro- 
vided by a user. When those are not present, type checking may not be able to 
function and provide any guarantees about the program. However, this misses the 
opportunity to probabilistically reason over the types of the program from other 
sources of information — such as variable names and comments. Concretely, in the 
example of Figure 22.1, it would be reasonable to assume that min_len and max_len 
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have an integer type given their names and usage. We can then use this “educated 
guess” to type check the program and retrieve back some guarantees about the pro- 
gram execution. 

Such models can find multiple applications. For example, they can be used in 
recommendation systems that help developers annotate a code base. They may help 
developers find incorrect type annotations or allow editors to provide assistive fea- 
tures — such as autocomplete — based on the predicted types. Or they may offer 
“fuzzy” type checking of a program (Pandi et al, 2020). 

At its simplest form, predicting types is a node classification task over the subset 
of symbol nodes. Let % be the set of nodes of “symbol” type in the heterogeneous 
graph of a program. Let also, Z be a fixed vocabulary of type annotations, along with 
a special Any type~. We can then use the node embeddings of every node v € % to 
predict the possible type of each symbol. 


p(s;: T) =softmax (E itn, +b) ; 
VEZ J 

i.e., the inner product of each symbol node embedding with a learnable type embed- 

ding E; for each type T € T plus a learnable bias b,. Training can then be performed 

by minimizing some classification loss, such as the cross entropy loss, over a corpus 

of (partially) annotated code. 


Type Checking The type prediction problem is a specification inference problem 
(Section 22.2) and the predicted type annotations can be passed to a standard type 
checking tool which can verify that the predictions are consistent with the source 
code’s structure (Allamanis et al, 2020) or search for the most likely prediction 
that is consistent with the program’s structure (Pradel et al, 2020). This approach 
allows to reduce false positives, but does not eliminate them. A trivial example is 
an identity function def foo(x): return x. A machine learning model may incorrectly 
deduce that x is a str and that foo returns a str. Although the type checker will 
consider this prediction type-correct it is hard to justify as correct in practice. 


Training The type prediction model discussed in this section can be trained in a 
supervised fashion. By scraping large corpora of code, such as open-source code 
found on GitHub’, we can collect thousands of type-annotated symbols. By strip- 
ping those type annotations from the original code and using them as a ground truth 
a training and validation set can be generated. 

Such systems have shown to achieve a reasonably high accuracy (Allamanis et al, 
2020) but with some limitations: type annotations are highly structured and sparse. 
For example Dict[Tuple[int, str], List{bool]] is a valid type annotation that may 
appear infrequently in code. New user-defined types (classes) will also appear at test 
time. Thus, treating type annotations as district classes of a classification problem 


2 The type Any representing the top of the type lattice and is somewhat analogous to the special 
UNKNOWN token used in NLP. 

3 Automatically scraped code corpora are known to suffer from a large number of duplicates (Al- 
lamanis, 2019). When collecting such corpora special care is needed to remove those duplicates to 
ensure that the test set is not contaminated with training examples. 
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1 def __init__ ( 

self, 
3 — embedding_dim: float = 768, 
4— £fn_embedding_dim: float = 3072, 
5- num_attention_heads: float = 8, 
embedding_dim: int = 768, 
ffn_embedding_dim: int = 3072, 
num_attention_heads: int = 8, 
dropout: float = 0.1, 
attention_dropout: float = 0.1, 


+++ 


Fig. 22.3: A diff snippet from the incorrect type annotation caught by Typilus (Al- 
lamanis et al, 2020) in the open-source fairseq library. 


is prone to severe class imbalance issues and fails to capture information about the 
structure within types. Adding new types to the model can be solved by employing 
meta-learning techniques such as those used in Typilus (Allamanis et al, 2020; Mir 
et al, 2021), but exploiting the internal structure of types and the rich type hierarchy 
is still an open research problem. 

Applications of type prediction models include suggesting new type annotations 
to previously un-annotated code but can also be used for other downstream tasks 
that can exploit information for a probabilistic estimate of the type of some symbol. 
Additionally, such models can help find incorrect type annotations provided by the 
users. Figure 22.3 shows such an example from Typilus (Allamanis et al, 2020). 
Here the neural model “understands” from the parameter names and the usage of the 
parameters (not shown) that the variables cannot contain floats but instead should 
contain integers. 


22.7 Future Directions 


GNNs for program analysis is an exciting interdisciplinary field of research combin- 
ing ideas of symbolic AI, programming language research, and deep learning with 
many real-life applications. The overarching goal is to build analyses that can help 
software engineers build and maintain the software that permeates every aspect of 
our lives. Still there are many open challenges that need to be addressed to deliver 
upon this promise. 

From a program analysis and programming language perspective a lot of work is 
needed to bridge the domain expertise of that community to machine learning. What 
kind of learned program analysis can be useful to coders? How can existing program 
analyses be improved using learned components? What are the inductive biases that 
machine learning models need to incorporate to better represent program-related 
concepts? How should learned program analyses be evaluated amidst the lack of 
large annotated corpora? Until recently, program analysis research has limited itself 
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to primarily using the formal structure of the program, ignoring ambiguous informa- 
tion in identifiers and code comments. Researching analyses that can better leverage 
this information may light new and fruitful directions to help coders across many 
application domains. 

Crucially, the question of how to integrate formal aspects of program analyses 
into the learning process is still an open question. Most specification inference work 
(e.g. Section 22.6) commonly treats the formal analyses as a separate pre- or post- 
processing step. Integrating the two viewpoints more tightly will create better, more 
robust tools. For example, researching better ways to incorporate (symbolic) con- 
straints, search, and optimization concepts within neural networks and GNNs will 
allow for better learned program analyses that can learn to better capture program 
properties. 

From a software engineering research additional research is needed for the user 
experience (UX) of the program analysis results presented to users. Most of the 
existing machine learning models do not have performance characteristics that al- 
low them to work autonomously. Instead they make probabilistic suggestions and 
present them to users. Creating or finding the affordances of the developer environ- 
ment that allow to surface probabilistic observations and communicate the proba- 
bilistic nature of machine learning model predictions will significantly help accel- 
erate the use of learned program analyses. 

Within the research area of GNNs there are many open research questions. GNNs 
have shown the ability to learn to replicate some of the algorithms used in common 
program analysis techniques (Veličković et al, 2019) but with strong supervision. 
How can complex algorithms be learned with GNNs using just weak supervision? 
Additionally, existing techniques often lack the representational capabilities of for- 
mal methods. Combinatorial concepts found in formal methods, such as sets and 
lattices lack direct analogues in deep learning. Researching richer combinatorial 
— and possibly non-parametric — representations will provide valuable tools for 
learning program analyses. 

Finally, common themes in deep learning also arise within this domain: 


e The explainability of the decisions and warnings raised by learned program 
analyses is important to coders who need to understand them and either mark 
them as false positives or address them appropriately. This is especially impor- 
tant for black-box analyses. 

e Traditional program analyses offer explicit guarantees about a program’s behav- 
ior even within adversarial settings. Machine learning-based program analyses 
relax many of those guarantees towards reducing false positives or aiming to 
provide some value beyond the one offered by formal methods (e.g. use am- 
biguous information). However, this makes these analyses vulnerable to adver- 
sarial attacks (Yefet et al, 2020). Retrieving some form of adversarial robustness 
is still desirable for learned program analyses and is still an open research prob- 
lem. 

e Data efficiency is also an important problem. Most existing GNN-based pro- 
gram analysis methods either make use of relatively large datasets of annotated 
code (Section 22.6) or use unsupervised/self-supervised proxy objectives (Sec- 
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tion 22.5). However, many of the desired program analyses do not fit these 
frameworks and would require at least some form of weak supervision. 
Pre-training on graphs is one promising direction that could address this prob- 
lem, but has so far is focused on homogeneous graphs, such as social/cita- 
tion networks and molecules. However, techniques developed for homogeneous 
graphs, such as the pre-training objectives used, do not transfer well to hetero- 
geneous graphs like those used in program analysis. 

e All machine learning models are bound to generate false positive suggestions. 
However when models provide well-calibrated confidence estimates, sugges- 
tions can be accurately filtered to reduce false positives and their confidence 
better communicated to the users. Researching neural methods that can make 
accurate and calibrated confidence estimates will allow for greater impact of 
learned program analyses. 


Acknowledgements The author would like to thank Earl T. Barr for useful discussions and feed- 
back on drafts of this chapter. 


Editor’s Notes: Program analysis is one of the important downstream tasks 
of graph generation (Chapter 11). The main challenging problem of pro- 
gram analysis lies in graph representation learning (Chapter 2), which inte- 
grates the relationships and entities of the program. On basis of these graph 
representations, heterogeneous GNN (Chapter 16) and other variants can be 
used to learn the embedding of each node for task-specific neural networks. 
It has achieved state-of-art performances in bug detection and probabilistic 
type inference. There are also many emerging problems in program analy- 
sis, e.g. explainability (Chapter 7) of decisions and warnings, and adversar- 
ial robustness (Chapter 8). 
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Graph Neural Networks in Software Mining 


Collin McMillan 


Abstract Software Mining encompasses a broad range of tasks involving software, 
such as finding the location of a bug in the source code of a program, generating nat- 
ural language descriptions of software behavior, and detecting when two programs 
do basically the same thing. Software tends to have an extremely well-defined struc- 
ture, due to the linguistic confines of source code and the need for programmers to 
maintain readability and compatibility when working on large teams. A tradition 
of graph-based representations of software has therefore proliferated. Meanwhile, 
advances in software repository maintenance have recently helped create very large 
datasets of source code. The result is fertile ground for Graph Neural Network rep- 
resentations of software to facilitate a plethora of software mining tasks. This chap- 
ter will provide a brief history of these representations, describe typical software 
mining tasks that benefit from GNNs, demonstrate one of these tasks in detail, and 
explain the benefits that GNNs can provide. Caveats and recommendations will also 
be discussed. 


23.1 Introduction 


Software Mining is broadly defined as any task that seeks to solve a software en- 
gineering problem by analyzing the myriad artifacts in projects and their connec- 
tions (Hassan and Xie, 2010; Kagdi et al, 2007; Zimmermann et al, 2005). Consider 
the task of writing documentation. A human performing this task may gain compre- 
hension of the software by reading the source code and understanding how different 
parts of the code interact. Then he or she may write documentation explaining the 
behavior of the system based on that comprehension. Likewise, if a machine is to 
automate writing that documentation, the machine must also analyze the software 
in order to comprehend it. This analysis is often called “Software Mining.” 
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While human comprehension of software is a cognitive process that occurs natu- 
rally as engineers read and interact with that software (Letovsky, 1987; Maalej et al, 
2014), machine comprehension must be formally defined and quantifiable. Typically 
this boils down to a vectorized representation of each software artifact. For exam- 
ple, each identifier name in a function may be assigned an, e.g., 100-length vector 
denoting its position in a word embedding space. Then the function may be the av- 
erage of those vectors for the identifier names it contains. Or it may be the output 
of a recurrent neural network given those identifier name vectors, or perhaps only 
the names that occur in particular locations. The point is that machine comprehen- 
sion of software is often quantifiable as a vectorized representation of the artifacts 
composing that software. 

Evidence is accumulating that Graph Neural Networks are an effective means to 
obtain these vectorized representations and thus improve machine comprehension of 
software. There is a long tradition in the Software Engineering research literature of 
treating software as a graph. Control flow graphs, call graphs, abstract syntax trees, 
execution path graphs, and many others are frequently the output of both static and 
dynamic analysis. Meanwhile, advances in software repository management have 
enabled the creation of datasets covering billions of lines of code. The result is 
fertile ground for GNNs. 

This chapter covers the history and state-of-the-art in representing software as 
a graph for GNNs, followed by a high-level discussion of current approaches, a 
detailed look at a specific approach, and caveats for future researchers. 


23.2 Modeling Software as a Graph 


Software is a high-value target for GNNs partly because software tends to be very 
highly structured as a graph or set of graphs. Different software mining tasks may 
take advantage of different graph structures from software. Graph representations 
of software go far beyond any specific software mining task. Graph representations 
are baked into the way compilers convert source code into machine code (e.g., parse 
trees). They are used during linking and dependency resolution (e.g., program de- 
pendence graphs). And they have long the basis for many visualization and support 
tools to help programmers understand large software projects (Gema et al, 2020; 
Ottenstein and Ottenstein, 1984; Silva, 2012). 

When considering how to make use of these different graph structures in soft- 
ware, basically the questions one must ask are: “what are the nodes?” and “what 
are the edges?” These questions take two forms in software engineering research: 
a macro- and a micro-level representation. The macro-level representation tends to 
concern connections among large software artifacts, such as a graph in which ev- 
ery source code file is a node and every dependency among the files is an edge. 
The micro-level representation, in contrast, tends to include small details, such as a 
graph in which every token in a function is a node, and every edge is a syntactic link 
between the nodes, such as are often extracted from an Abstract Syntax Tree. 
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This section compares and contrasts these representations as they relate to using 
GNNs for Software Mining tasks. 


23.2.1 Macro versus Micro Representations 


Graph structures in software may be broadly classified as either macro- or micro- 
level. In theory, the distinction is superfluous because a micro-level representation 
may be scaled up to arbitrary size. For example, an entire large program may be 
represented as one large abstract syntax tree. But in practice, time and space con- 
straints necessitate a separation of macro- and micro-level representations. In a re- 
cent collection of Java programs (LeClair and McMillan, 2019), the average number 
of nodes in the AST of a function is over 120, with at least one edge per node. The 
average number of functions per program is over 1800, and there are over 28,000 
programs in the dataset. The reality is that a micro-level representation of an en- 
tire program is often not feasible, so a macro-level representation is introduced to 
capture the “big picture.” 


23.2.1.1 Macro-level Representations 


A macro-level graph representation of software captures the high-level structure 
and intent behind a program while avoiding a deep dive into details required to 
implement that intent. Inspiration for macro-level representations is often drawn 
from software design documents, such as those formally defined via UML (Braude 
and Bernstein, 2016; Horton, 1992). An example is a class diagram for an object- 
oriented program. Each class is a node in the graph. Edges in the graph may var- 
iously be dependency, inheritance, realization, composition, among others. Nodes 
may also have attributes that refer to the member variables and methods of a class. 

In practice, selecting a macro-level representation for a software mining task us- 
ing GNNs tends to be severely constrained by what can actually be obtained from 
the dataset. Often this constraint precludes the use of behavior-based graphs such 
as use case diagrams, because proper use case diagrams are rare, and those that are 
available are usually not in a consistent format. For example, because some engi- 
neers might follow different conventions, or only provide these diagrams informally. 
Software repositories tend to be replete with source code but lack documentation, 
especially design documentation (Kalliamvakou et al, 2014). 

Therefore, by far, the most popular macro-level graph representations tend to be 
ones that can be extracted directly from source code. A decision often arises related 
to the degree of granularity, which usually is a choice between packages/directories, 
classes/files, or methods/functions. The class diagram is relatively easy to locate ev- 
ery class in a software project, then analyze each class to find their dependencies, 
inheritances, and etc. Package diagrams are similar, having the advantage of quickly 
providing a very high level view of a program — even large projects may only have a 
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few dozen packages. But a very popular alternative is a function/method call graph, 
in which each function in a program is a node and each call relationship from one 
function to another is a directed edge between two nodes. Call graphs are popu- 
lar within Software Engineering literature because they are relatively easy to extract 
while giving enough detail for a strong macro-level view of a program without over- 
whelming data sizes (recall a typical program has around 1800 functions (LeClair 
and McMillan, 2019)). 


23.2.1.2 Micro-level Representations 


A micro-level representation describes a portion of the software in great detail. 
Micro-level representations have been the focus of a majority of research using 
GNNs for software mining. Allamanis et al (2018b) describe one approach, point- 
ing out that the “backbone of a program graph is the program’s abstract syntax tree.” 
However, as mentioned above, it is often not feasible to build a model relying on the 
entire AST of an entire program. Instead, a typical practice is to generate the AST 
for small portions of code, such as individual functions. Each function is treated as 
a graph, independent of all other functions. 

The benefit of treating each function as a separate graph is that a GNN model 
can be trained on each independently. A prediction model of nearly any kind will 
require independent, self-contained examples. There will be some context about 
which an output prediction is generated (or against which a sample prediction is 
used for training). By treating each function as an independent graph, a GNN can 
be trained using each function as the context. This is a tidy solution in software 
mining for two reasons. First, many tasks in software mining involve predictions 
about specific functions, such as whether that function is likely to contain a fault 
(see the next section). Second, graphs of functions derived from the AST exhibit a 
community structure. In a typical function, there are many connections among nodes 
inside the function, but relatively few connections from nodes inside the function to 
nodes outside the function — the variables, conditionals, loops, and etc., in the code 
of a function interact closely with each other, while must less frequently referring 
to something outside the function such as the use of a global variable or call. 

One may concoct any number of micro-level representations of software, based 
on different tokens in the source code and relationships of those tokens. For ex- 
ample, control flow relationships have occasionally been highlighted as often more 
valuable for comprehension than data dependencies (Dearman et al, 2005; Ko et al, 
2006). At other times, method invocations (Mcmillan et al, 2013; Sillito et al, 2008) 
or signatures (Roehm et al, 2012) are proposed as providing superior information 
for different software mining tasks. Yet the pattern is that a micro-level representa- 
tion is generated for many small portions of a software system, and these portions 
are treated as independent of each other. A GNN can take advantage of these micro- 
level representations by learning from each one as a different sample. 
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23.2.2 Combining the Macro- and Micro-level 


Macro- and micro-level representations may be combined. One strategy would be to 
compute both macro- and micro-level representations independently, then concate- 
nate them into one large context matrix. Such a model may be referred to as “dual 
encoder” (Chidambaram et al, 2019; Yang et al, 2019h) or “cascading” (Wang et al, 
2017h) in that they learn two representations of the same object but at different 
levels of granularity. An alternative would be to use the output of the micro-level 
representation to seed the macro-level representation, for example, by learning a 
representation of each function using the AST and then using it as the initial value 
for the nodes in a function call graph. 


23.3 Relevant Software Mining Tasks 


Graph neural networks are becoming a staple of research in software mining tasks. 
The history of deep learning for software mining tasks is chronicled in several sur- 
veys (Allamanis et al, 2018a; Lin et al, 2020b; Semasaba et al, 2020; Song et al, 
2019b). Allamanis et al (2018a) cast a particularly wide net and broadly classify 
software mining tasks that rely on neural networks as either “code generational” 
or code representational.” This classification is based on a big picture view of the 
models used for these tasks. In a code generational task, the output of the model is 
source code. Tasks in this category include automatic program repair (Chen et al, 
2019e; Dinella et al, 2020; Wang et al, 2018d; Vasic et al, 2018; Yasunaga and 
Liang, 2020), code completion (Li et al, 2018a; Raychev et al, 2014), and compiler 
optimization (Brauckmann et al, 2020). These models tend to be trained with large 
volumes of code vetted somehow to ensure quality, with the aim of learning norms 
in code that lead to that quality. Then, during inference, the goal is to bring arbi- 
trary code into closer conformance with those norms. For example, a model may be 
presented with code containing a bug, and that bug may be repaired by changing 
the code to be more like the model’s predictions (which, it is hoped, represent the 
norms learned in training). 

In contrast to code generational tasks are code representational tasks. These tasks 
use source code primarily as the input to a neural model during training but have a 
wide variety of outputs. Tasks in this category include code clone detection (Ain 
et al, 2019; Li et al, 2017c; White et al, 2016), code search (Chen and Zhou, 2018; 
Sachdev et al, 2018; Zhang et al, 2019f), type prediction (Pradel et al, 2020), and 
code summarization (Song et al, 2019b). In models designed to solve these tasks, the 
goal is usually to create a vectorized representation of code, which is then used for 
a specific task that may only be tangentially related to the code itself. For instance, 
for source code search, a neural model may be used to project the source code in 
a large repository into a vector space. Then a different model is used to project a 
natural language query into the same vector space. The code nearest to the query 
in the vector space is considered as the search result for that query. Code clone 
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detection is similar: code is projected into a vector space, and very nearby code may 
be considered a clone in that space. 

The use of graph neural networks is ballooning in both categories of software 
mining tasks. In code generational tasks, the focus tends to be on modifications to a 
program graph such as an AST that bring that graph into closer conformity with the 
model’s expectations. While some approaches focus on code as a sequence (Chen 
et al, 2019e), the recent trend has been to recommend graph transformations or 
highlight non-conforming areas of the graph (Dinella et al, 2020; Yasunaga and 
Liang, 2020). This is useful in code because a recommendation may relate to code 
elements that are quite far away from each other, such as the declaration of a vari- 
able and a use of that variable. In contrast, in code representational tasks, the focus 
tends to be on creating ever more complex graph representations of code and then 
using GNN architectures to exploit that complexity. For example, the first GNN- 
based approaches tended to use only the AST (LeClair et al, 2020), while newer 
approaches use attention-based GNNs to emphasize the most important edges out 
of a multitude that can be extracted from code (Ziigner et al, 2021). Despite differ- 
ences in code generational and representational tasks, the trend in both categories 
has strongly favored GNNs. 

Consider the task of code summarization, which exemplifies the trend towards 
GNNs. Code summarization is the task of writing natural language descriptions of 
source code. Typically these descriptions are used in documentation for that source 
code, e.g., JavaDocs. The evolution of this research area is shown in Figure 23.1. 
The term “code summarization” was coined around 2010, and several years of active 
research followed using templated and IR-based solutions. Then around 2017, solu- 
tions based on neural networks proliferated. At first, these were essentially seq2seq 
models in which the encoder sequence is the code and decoder sequence is the de- 
scription. Starting around 2018, the state-of-the-art moved to linearized AST repre- 
sentations. Graph neural networks were proposed around this time as a better solu- 
tion (Allamanis et al, 2018b), but it would be another year or more for GNN-based 
approaches to appear in the literature. GNNs are poised to underpin the state-of-the- 
art. In the next section, we dive into the details of a GNN-based solution, showing 
why it works and areas of future growth. 


23.4 Example Software Mining Task: Source Code 
Summarization 


This section describes source code summarization as an example software mining 
task that benefits from GNNs. Source code summarization, as mentioned above, is 
the task of writing natural language descriptions of source code. The input to a code 
summarization model includes at least the source code being described, though may 
also include other details about the software project from which the code originates. 
The output is the natural language description. This task is considered “code repre- 
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sentational’” because it primarily relies on a learned representation of code in order 
to make predictions about the description. 


23.4.1 Primer GNN-based Code Summarization 


As a primer towards GNN-based code summarization, consider a technique pre- 
sented by LeClair et al (2020). This model is intended to be a straightforward appli- 
cation of convolutional GNNs in the vein of graph2seq (Xu et al, 2018c). 


IRM T A S G 
Haiduc et al (2010) x 
Sridhara et al (2011) x x 
Rastkar et al (2011) 
De Lucia et al (2012) 
Panichella et al (2012) 
Moreno et al (2013) 
Rastkar and Murphy (2013) 
McBurney and McMillan (2014) x x 
Rodeghero et al (2014) x 
Rastkar et al (2014) x 
Cortés-Coy et al (2014) x 
Moreno et al (2014) x 
Oda et al (2015) x 
Abid et al (2015) x x 
Iyer et al (2016) x 
McBurney et al (2016) x x 
Zhang et al (2016a) x x 
Rodeghero et al (2017) x 
Fowkes et al (2017) x 
Badihi and Heydarnoori (2017) x x 
Loyola et al (2017) 
Lu et al (2017b) 
Jiang et al (2017) 
Hu et al (2018c) 
Hu et al (2018b) 
Wan et al (2018) 
Liang and Zhu (2018) 
Alon et al (2019a,b) 
Gao et al (2019b) 
LeClair et al (2019) 
Nie et al (2019) 
Haque et al (2020) 
Haldar et al (2020) 
LeClair et al (2020) 
Ahmad et al (2020) 
Ziigner et al (2021) 
Liu et al (2021) 


a KK mK OK 
> 


a KK 


wm Ke KKK KK KK KK KK OK OK OK 


am eK KK KK OK 


Table 23.1: Overview of papers on the topic of source code summarization, from the paper to 
coin the term “code summarization” in 2010 to the following ten years. Note the evolution from 
IR/template-based solutions to neural models and now to GNN models. Column /R indicates if the 
approach is based on Information Retrieval. M indicates manual features/heuristics. T indicates 
templated natural language. A indicates Artificial Intelligence (usually Neural Network) solutions. 
S means structural data such as the AST is used (for Al-based models). G means a GNN is the 
primary means of representing that structural data. 
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23.4.1.1 Model Input / Output 


The input to this technique is a micro-level representation of code: it is just the AST 
of a single subroutine. The nodes in the graph are all nodes in the GNN, whether 
they are visible to the programmer or not. The only edge type is the parent-child 
relationship in the AST. Consider the code and example summaries in Example 23.1 
and the AST of this code in Figure 23.1. Regarding the Figure 23.1, bold indicates 
text from source code that is visible to a human reader in the source code file — 
a depth-first search of the leaf nodes reveals the code sequence. E.g., “public void 
send guess ...” Non-bold indicates AST nodes that the compiler uses to represent 
structure. Visible text is preprocessed as it would appear to the model. For example, 
the name sendGuess is split into send and guess, and both nodes are children 
of a name node, which is a child of function. Neither name nor function is 
visible to a human reader. The circled areas 1-4 are reference points for discussion 
in Sections 23.4.1.4 and 23.4.2. 

The AST in Figure 23.1 is the only input to the model, from which the model 
must generate an English description. Technically, the AST is srcml (Collard et al, 
2011) preprocessed (e.g., splitting identifies such as sendGuess into send and 
guess) using community standard procedures (LeClair and McMillan, 2019). The 
reference output description in Example 23.1 is the actual JavaDoc summary written 
by a human programmer. The summary labeled “gnn ast” is the prediction from this 
approach. The summary labeled “flat ast’ is the output from an immediate prede- 
cessor that used an RNN on a linearization of the AST. The only difference between 
the GNN and flat AST approach is the structure of the encoder; all other model de- 
tails are identical. Yet, we note that the GNN-based approach matched the reference 
exactly, while the flat AST approach matched only a few words. Shortly we will 
analyze this example to provide intuition about why the model performed so well. 


summaries 

reference sends a guess to the server 
ast-attendgru-gnn (LeClair et al, 2020) sends a guess to the socket 
ast-attendgru-flat (LeClair et al, 2019) attempts to initiate a <UNK> guess 


source code 


public void sendGuess(String guess) { 


if( isConnected() ) { 
gui.statusBarInfo("Querying...", false); 
try { 
os.write( (guess + "\\r\\n").getBytes() ); 


os.flush(); 

} catch (IOException e) { 
gui.statusBarInfo("Failed to send guess.", true); 
System.err.println("IOException during send guess"); 


Example 23.1: The function sendGuess () and summary descriptions. 
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23.4.1.2 Model Architecture 


The model architecture, as mentioned, is essentially a 2-hop graph2seq design based 
on a convolutional GNN. While we leave the details of the model to the relevant 
paper (LeClair et al, 2020), a bird’s-eye view of the model is in Figure 23.2. 

The model input is derived only from a single subroutine being described: the 
code as a sequence and the AST nodes and edges (Figure 23.2 area A). A word em- 
bedding projects tokens in the sequence and nodes in the AST into the same vector 
space, which is possible because the vocabulary is the same in both the sequence 
and the node input (area B). A 2-hop convolutional GNN is used to form a vec- 
torized representation of the AST (area C). The output after the second hop is a 
matrix in which each column is a vector representing a node in the AST. A GRU 
is then applied to this matrix to capture information about the order in which the 
nodes appear. Meanwhile, a GRU is also applied to the sequence directly (area D). 
The decoder is a simple GRU representation of the summary (area H). Attention is 
applied between the decoder output and the sequence GRU output, as well as the 
GNN output (area E). The attended matrices are then concatenated into a context 
matrix (area F) and connected to an output dense layer (area G). 

A key feature of the model is the attention between the decoder and the GNN 
output. The purpose of this attention is to highlight the nodes in the AST that are the 
most related to the words in the decoder sequence. We will describe below how this 
attention was made much more effective by the shared word embedding (area B). 


23.4.1.3 Experiment 


An experiment demonstrated improvement of the GNN model over various base- 
lines, and explored the effects of various model design decisions. The experiment 
used a dataset of 2.1m Java methods and associated JavaDoc summaries (LeClair 
et al, 2020). Essentially the conditions were that 80% of the projects in the dataset 
were assigned for the training set, and 10% each for validation/testing. Duplicates 
and other defects were removed from the dataset in accordance with community 
standards (LeClair and McMillan, 2019). The model was trained with methods from 
the projects in the training set. The training ran for 10 epochs, and the model with 
the highest validation accuracy was selected for testing. The predictions from the 
tests were then compared with reference summaries. 

Three findings stand out in findings reported by LeClair et al (2020). First, the 
GNN-based approaches outperform the most-similar baseline (ast-attendgru-flat) by 
about | BLEU point (about a 5% improvement). Since the only difference between 
the “flat” model and this GNN-based one is the AST encoder portion of the model, 
the improvement can be attributed to the use of the GNN (as opposed to an RNN) for 
the AST encoding. Improvement was also observed over two other baselines. The 
vanilla graph2seq model, which had only the AST and not the sequence encoder 
(Figure 23.2 area A), was roughly equivalent to the flat AST model in terms of 


508 Collin McMillan 


aggregate BLEU score but this score obscures some details of the performance, 
which we will see in the next section. 

The second key finding is that a hop distance of two results in the best over- 
all performance. While models with GNN iterations ranging between one and ten 
all achieve higher scores than the baselines, the model performs best with two it- 
erations. One explanation is that nodes in the AST are only relevant to each other 
within a distance of about two. The AST is a tree, so information is propagated up 
and down levels of the tree. For two hops, this means information from a node will 
propagate to its parent in the first hop and then to its grandparent and siblings in 
the second hop. It is possible that nodes beyond this scope are not that relevant to 
the model for code summarization. However, another explanation is that the method 
of aggregating information in each hop is less efficient after two hops — this inter- 
pretation would be consistent with findings by Xu et al (2018c) that aggregation 
procedure is critical to GNN deployment. Either way, the practical advice for model 
designers is that the optimal number of GNN iterations for this task is not that high. 

The third key finding is that the use of the GRU after the GNN layer (Figure 23.2 
after area C) improves overall performance. The models labeled with the suffix 
+GRU use this GRU layer, as described in Section 23.4.1.2. The model labeled with 
the suffix +dense calculates attention between the decoder and the output matrix 
from the GNN. This model did not perform as well. A likely explanation is that 
source code has not only a tree structure via the AST — it also has an order from 
start to end. The GRU after the GNN captures this order and seems to result in a 
better representation of the code for summarization. 


23.4.1.4 What benefit did the GNN bring? 


A question remains regarding what benefit can be attributed to the use of a GNN. 
While we and others may observe an improvement in overall BLEU scores when 
using a GNN (LeClair et al, 2020; Ziigner et al, 2021; Liu et al, 2021), a key point 
is that the GNN contributes orthogonal information to the model. This section ex- 
plores how. 


Concentration of Improvement: 

The improvement is concentrated among a set of subroutines where the GNN 
adds significant improvement. It is not the case that the BLEU scores increase 
marginally for all subroutines — there is a set of subroutines that benefits the most. 
Consider Figure 23.3. The pie chart divides the test set into subroutines from the 
experiment describe above into five groups: one group where ast-attendgru-gnn per- 
formed the best, one group where ast-attendgru-flat performed the best, one group 
where they tied, one group for attendgru, and one group for other ties including 
when all models made the same prediction. For simplicity, we use BLEU-1 scores 
(BLEU-1 is unigram precision, single words predicted correctly). 

What we observe is that each model achieves the highest BLEU-1 score for 20- 
25% of the subroutines. For about 12% of the subroutines, the AST-based models 
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were tied, meaning that in total over 50% of the subroutines benefited from AST 
information (GNN plus flat AST models). But there still exists a large set of sub- 
routines where attendgru outperformed all others. However, consider the bar chart 
in Figure 23.3. The “all” columns show the BLEU-1 score for that approach — note 
that ast-attendgru-gnn is only marginally higher than others. The “best” columns 
show the score for the set where that model achieved the highest BLEU-1 score (the 
set with that model’s name indicated in the pie chart). We observe that the BLEU-1 
scores for ast-attendgru-gnn are much higher for this set than others. 


Demonstrating Improvement in Example 23.1: 

A deeper dive into the subroutine sendGuess () from Example 23.1 demon- 
strates the improvement that a GNN provides. Recall that the ast-attendgru-gnn 
model calculates attention between each position in the decoder and each node in 
the output from the GNN (Section 23.4.1.2, Figure 23.2 area E). The result is an m 
x n matrix where m is the length of the decoder sequence and n is the number of 
nodes (in the implementation, m=13 and n=100). Thus each position in the attention 
matrix represents the relevance of an AST node to a word in the output summary. 
In fact, the attention matrix for ast-attendgru-flat has the same meaning: the mod- 
els are identical except that ast-attendgru-gnn encodes the AST with a GNN then a 
GRU, while the flat model uses only the GRU. Comparing the values in these atten- 
tion matrices provides a useful contrast of the two models because they show the 
contribution of the AST encoding to the prediction. 

The benefit of a GNN becomes apparent in the attention networks in Figure 23.3. 
Both models have a very similar attention activation to the tokens in the source code 
sequence (Figures 23.3a and 23.3c). Both models show close attention to position 
2 of the code sequence, which is the word “send”. This is not surprising consider- 
ing that “send” appears in the method’s name. Yet, ast-attendgru-flat still incorrectly 
predicts the first word of the summary as “attempts”, while ast-attendgru-gnn cor- 
rectly predicts “sends.” The explanation lies in the attention to AST nodes. The 
flat model focuses on node 37 (Figure 23.3d), which is an expr_stmt node immedi- 
ately after the try block, just before the call to os. write (), indicated as area 
1 in Figure 23.1. The reason for this focus suggested by the original paper on that 
model (LeClair et al, 2019) is that the flat AST model tends to learn broadly similar 
code structure such as “‘if-block, try-block, call to os . write () 2” Under this expla- 
nation, methods in the training set with this if-try-call-catch pattern are associated 
with the word “attempts.” 

In contrast, the GNN-based model focuses on position 8, which is the word 
“send” in the method name, just like in the attention to the code sequence (Fig- 
ure 23.3b). The result is that the GNN-based AST encoding reinforces the attention 
paid to this word when predicting the first word of the output. Consider the method’s 
AST in Figure 23.1. Position 8 is the node for “send” indicated at area 2. In a 2-hop 
GNN, this node will share information with its parent (name), grandparent (func- 
tion), and sibling (guess). During training, the model learned that words associated 
with the AST nodes “function” and “name” are likely candidates for the first word 
of the summary, so the model knows to highlight this word. 
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In short, the GNN model outperformed because it conveys a lopsided benefit to 
a particular subset of the subroutines, and a likely reason it conveys this benefit is 
that it learns to associate AST tokens with particular locations in the code summary. 


23.4.2 Directions for Improvement 


The view of software as a graph described in Section 23.2 provides two directions 
for improvement: micro- and macro-level representations. Essentially the choice is 
whether to attempt to squeeze more information out of the source code being de- 
scribed (micro-level) or to draw upon more information from outside that source 
code (macro-level). If the aim is to generate summaries of a Java method, then one 
may learn more information about the details of that method, or one may use in- 
formation from the classes, packages, dependencies, and etc., around the method. 
Micro- and macro-level improvements tend to be complementary rather than com- 
petitive. Learning more about the macro-level graph information benefits models of 
micro-level information and visa versa (Haque et al, 2020). 


23.4.2.1 Example Micro-level Improvement 


Liu et al (2021) present a notable example of an improvement to GNN-based code 
summarization using a richer micro-level graph representation of software. The es- 
sentials of the approach are similar to (LeClair et al, 2020) described above: the 
input to the model is the source code of a subroutine, and the output is a description 
of the subroutine. The encoder is based on a GNN, and the input to this GNN is the 
AST of the subroutine. The nodes in the graph are AST nodes, and the edges are the 
AST parent-child relationships. However, one novel aspect is that the model also 
considers other types of edges, namely control flow and data dependencies (these 
are unified as a Code Property Graph (Yamaguchi et al, 2014)). The benefit to this 
structure is that nodes in the AST will receive information directly from other rele- 
vant parts of the code, rather than only the nodes nearby in the AST. 

Consider Figure 23.1 area 3, which is an AST node corresponding to the string 
variable “guess” in Example 23.1. The ast-attendgru-gnn approach would propa- 
gate information from that variable to the parents, grandparents, and siblings (in the 
two hops configuration). These would be the “name” and “decl” AST nodes. These 
nodes have locations in the word embedding associated with them, and these nodes 
also appear in practically every subroutine in the dataset. So, the model will learn 
how these nodes are used and associate them with what a human would call a vari- 
able declaration. The effect in this example is that the model will learn that the word 
“guess” is a variable name declaration. 

The approach by Liu et al. improves over ast-attendgru-gnn because it can learn 
this relationship in addition to several others. The experiment with ast-attendgru-gnn 
showed evidence that AST structural information can lead to a better representation 
of code — it is useful to know that “guess” is a variable name declaration. But other 
relationships also exist. The variable “guess” is used in the call to os. write (). This 
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relationship is a data dependency and is useful to human readers (Freeman, 2003). 
A human attempting to comprehend this code would likely note that whatever is 
passed into the subroutine as a parameter via the variable “guess” is subsequently 
written out via a method call. The benefit to Liu et al.’s approach is that it captures 
this relationship and uses it to form a more-complete GNN-based representation of 
the code. 

A caveat is that as more edge types are added to the graph, more information 
will be propagated among nodes, which may have effects that are difficult to ex- 
plain. Imagine in Figure 23.1 if an edge were to exist between “guess” at area 4 and 
“guess” at area 1, denoting a data dependency. A typical GNN design would prop- 
agate information across this edge. The result would be that the nodes around the 
location that uses “guess” would gain information from the nodes where “guess” is 
defined. But now imagine a control dependency from the t ry block start to the call 
to os.write(). The information would then also propagate from the try block to 
the use of “guess” over the control flow edge and then from the use of “guess” to the 
definition of “guess” over the data flow edge. This connection is difficult to explain 
— it is not clear what it means for a try block to be connected to the parameter list. 
A human may proffer an explanation for this particular subroutine, but a model such 
as ast-attendgru-gnn would always propagate information across these edges, even 
when it does not make sense to do so. 

Liu et al. solve this problem by using an attentional GNN proposed by Zhu et al 
(2019b). Essentially, this GNN adds an attention layer as a gate prior to propagating 
information across an edge. The input to this gate includes the node embedding for 
the node at the origin of the edge, plus an edge embedding for that type of edge. The 
result is that the model learns during training when to propagate information from a 
node over a particular type of edge. That way, information from the, e.g., try block 
may or may not propagate to the parameter list, depending on whether that particu- 
lar connection was useful during training. Liu et al. use the learned representation of 
code to help locate similar code comments in a database of those comments. How- 
ever, the big picture idea is to use an attentional GNN to emphasize some edges in 
the code over others when the graph representation of code becomes large and com- 
plex, and this idea may serve as inspiration for a variety of software mining tasks. 
It is an example of how better micro-level representations of code can assist these 
software mining tasks. 


23.4.2.2 Example Macro-level Improvement 


One inspiration for macro-level improvement to neural code summarization is from 
(Aghamohammadi et al, 2020). Their approach focuses on generating summaries of 
code in Android projects. The approach is divided into two parts. The first part cen- 
ters around an attentional encoder-decoder model similar to the attendgru baseline 
described by LeClair et al (2019). They use this model to generate an initial code 
summary based solely on the words inside the subroutine itself. The second part is to 
augment the initial summary with phrases from the summaries of other subroutines 
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in the same project. The approach is to obtain a dynamic call graph of the Android 
program, which represents the actual runtime control flow from one subroutine to 
the next. Then a subset of the subroutines in this call graph is selected using PageR- 
ank — the idea is to emphasize the subroutines, which are called many times or hold 
other importance measurable from the structure of the call graph (McMillan et al, 
2011). The summaries from these subroutines are then appended to the initial sum- 
mary. 

Aghamohammadi et al (2020)’s approach demonstrates an advantage to macro- 
level information. The macro-level information is the dynamic call graph of the 
entire program, and it is used to augment summaries created from the source code 
itself. The summaries tend to be longer and to provide more contextual informa- 
tion to readers. Recall sendGuess() in Example 23.1, for which ast-attendgru-gnn 
wrote “sends a guess to the socket.” The approach by Aghamohammadi et al (2020) 
may (hypothetically) find that the subroutine that calls sendGuess() is a mouse 
click handler subroutine, and so would append, e.g., “called when the mouse is used 
to click the button.” Human readers of documentation benefit from knowing how 
subroutines are used, so summaries that include this macro-level information tend 
to be considered more valuable by those readers (Holmes and Murphy, 2005; Ko 
et al, 2006; McBurney and McMillan, 2016). 

Macro-level representations of code for software mining tasks are likely fertile 
ground for GNN-based technologies. The dynamic call graphs which Aghamoham- 
madi et al (2020) extract contain information from actual runtime use, and a GNN 
may serve as a useful tool in generating a representation of this information. Yet, 
applications of GNNs to macro-level data for software mining tasks are still in their 
infancy. 


23.5 Summary 


In this chapter, we presented Software Mining Tasks as an application area for 
GNNs. A high-level view of any approach is to represent the software as a graph, 
then create a GNN model able to use this graph to learn to make predictions for a 
particular purpose. We present two views of software graphs: a micro- and macro- 
level representation. Micro-level representations predominate. For example, for the 
task of bug prediction in a subroutine, most approaches tend to look exclusively 
within those subroutines for patterns associated with that bug. Yet, evidence is 
emerging that macro-level representations may also benefit these tasks, as the con- 
text surrounding code is very likely to contain information necessary to compre- 
hend that code. The future likely lies in combined GNN models of both micro- and 
macro-level graph representations of software. 

We focus in this chapter on the task of source code summarization as an exam- 
ple of how GNN-based models help produce better predictions for software mining 
tasks. A straightforward approach is described in which the AST of subroutines is 
used to train a GNN, which leads to a better micro-level representation in many 
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cases. An improvement based on an attentional GNN shows how much more com- 
plex graphs can also be exploited for better for this purpose. Yet, these improve- 
ments for code summarization likely herald improvements for many software min- 
ing tasks. Both code representational and code generational tasks depend heavily on 
understanding the nuances of the structure that code, and GNNs are a likely avenue 
for capturing this structure. This chapter has covered the history of this research, a 
specific target problem, and recommendations for future researchers. 


Editor’s Notes: AI for Code is a very fast-growing area in the recent years. 
Computer software or program is just like a second language compared 
to human language, which is not surprising that there are many shared at- 
tributes or aspects in both languages. Therefore, we have seen this trend 
that both NLP and Software communities start paying a large amount of 
attentions in applying GNNs for their domain applications and achieve the 
great successes in both domains. Just like GNNs for NLP, graph structure 
learning techniques in Chapter 14, GNN Methods in Chapter 4, GNN Scal- 
ability in Chapter 6, Heterogeneous GNNs in Chapter 16, GNN Robustness 
in Chapter 8 are all highly important building blocks for developing an ef- 
fective and efficient approach with GNNs for code. 
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Figure 23.1: Abstract Syntax Tree for the function sendGuess () in Example 1. 
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Figure 23.2: High-level diagram of the model architecture for 2-hop model. 
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Figure 23.3: (left) Comparison of the BLEU-1 score for the subroutines where each 
method performed best, to BLEU-1 score for the whole test set. (right) Percent of 
test set for which each approach received the highest BLEU-1 score. 


516 Collin McMillan 


12111098 76543210 


3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 


(a) ast-attendgru-gnn attention to source code sequence 
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(b) ast-attendgru-gnn attention to AST nodes 
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(c) ast-attendgru-flat attention to source code sequence 


121110 9 8 7 6 


(d) ast-attendgru-flat attention to AST nodes 


Figure 23.4: Visualization of attention network for ast-attendgru-gnn and ast- 
attendgru-flat for the subroutine sendGuess () in Example 23.1 and AST in Fig- 
ure . Matrices are 13x100 because attention is applied between every position 
in the decoder output (length 13) and every position in the encoder (100 nodes or 
100 code tokens). Bright areas indicate high attention. For example, position 2 in 
the code sequence is heavily emphasized for both models. Position 2 corresponds to 
the word “send” in the code sequence. 
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GNN-based Biomedical Knowledge Graph 
Mining in Drug Development 


Chang Su, Yu Hou, Fei Wang 


Abstract Drug discovery and development (D°) is an extremely expensive and time 
consuming process. It takes tens of years and billions of dollars to make a drug suc- 
cessfully on the market from scratch, which makes this process highly inefficient 
when facing emergencies such as COVID-19. At the same time, a huge amount 
of knowledge and experience has been accumulated during the D? process during 
the past decades. These knowledge are usually encoded in guidelines or biomedi- 
cal literature, which provides an important resource containing insights that can be 
informative of the future D? process. Knowledge graph (KG) is an effective way 
of organizing the useful information in those literature so that they can be retrieved 
efficiently. It also bridges the heterogeneous biomedical concepts that are involved 
in the D? process. In this chapter we will review the existing biomedical KG and 
introduce how GNN techniques can facilitate the D° process on the KG. We will 
also introduce two case studies on Parkinson’s disease and COVID-19, and point 
out future directions. 


24.1 Introduction 


Biomedicine is a discipline with lots of highly specialized knowledge accumulated 
from biological experiments and clinical practice. This knowledge is usually buried 
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in massive biomedical literature and text books. This makes effective knowledge 
organization and efficient knowledge retrieval a challenging task. Knowledge graph 
is a recently emerged concept aiming at achieving this goal. A knowledge graph 
(KG) stores and represents knowledge by constructing a semantic network describ- 
ing entities and the relationships between them. The basic elements comprising a 
knowledge graph are a set of (head, relation, tail) tuples, where the heads and tails 
are concept entities and relations link these entities with semantic relationships. In 
biomedicine, the typical entities could be diseases, drugs, genes, etc., and the rela- 
tionships could be treats, binds, interactions, etc. Large scale biomedical KG makes 
efficient knowledge retrieval and inference possible. 

Biomedical KG can effectively complement the biomedical data analytics pro- 
cesses. In particular, many different types of biomedical data are heterogeneous and 
noisy (Wang et al, 2019f; Wang and Preininger, 2019; Zhu et al, 2019e), which 
makes the data-driven models developed on these data not reliable for real prac- 
tice. Biomedical KGs (BKGs) effectively encode the biomedical entities and their 
semantic relationships, which can serve as “prior knowledge” to guide the down- 
stream data-driven analytics procedure and improve the quality of the model. On 
the other hand, we can also use BKGs to generate hypotheses (such as which drug 
can be used to treat which disease), and get them validated in real world health data 
(such as electronic health records). 

In this chapter, we will review existing BKGs and present examples of how BKGs 
can be used for generating drug repurposing hypotheses, and point out future direc- 
tions. 


24.2 Existing Biomedical Knowledge Graphs 


This section surveys the existing BKGs that are publicly available and the ways of 
BKG construction and curation (Table 24.3). 

A common way for constructing a BKG is to extract and integrate data from 
data resources, usually, which are manually curated to summarize and organize the 
biomedical knowledge derived from biological experiments, clinical trials, genome 
wide association analyses, clinical practices, etc (Santos et al, 2020; Ioannidis et al, 
2020; Himmelstein et al, 2017; Rizvi et al, 2019; Yu et al, 2019b; Zhu et al, 2020b; 
Zeng et al, 2020b,b; Domingo-Fernandez et al, 2020; Wang et al, 2020e; Percha and 
Altman, 2018; Li et al, 2020d,b; Goodwin and Harabagiu, 2013; Rotmensch et al, 
2017; Sun et al, 2020a). In table 24.2, we summarized some public data resources 
that have been commonly used in the construction of BKGs. For instance, Compar- 
ative Toxicogenomics Database (CTD) (Davis et al, 2019) is an open resource pro- 
viding rich, manually curated chemical—gene, chemical—disease and gene-disease 
relational data, for the aim of advancing understanding the impacts of environmental 
exposures on human health. DrugBank (Wishart et al, 2018) is a database containing 
information of the approved drugs and drugs under trial, as well as the pharmacoge- 
nomic data (e.g., drug-target interactions). Ontology resources like Gene Ontology 
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(Ashburner et al, 2000) and Disease Ontology (Schriml et al, 2019) stored func- 
tional and semantic context of genes and diseases, respectively. By integrating data 
from these rich resources, a number of BKGs have been constructed (Santos et al, 
2020; Ioannidis et al, 2020; Himmelstein et al, 2017; Rizvi et al, 2019; Yu et al, 
2019b; Zhu et al, 2020b; Zeng et al, 2020b,b; Domingo-Fernandez et al, 2020; Wang 
et al, 2020e). For example, Hetionet (Himmelstein et al, 2017), released in 2017, is 
a well-curated BKG that integrates 29 publicly available biomedical databases. It 
contains 11 types of 47031 biomedical entities and 24 types of over 2 million re- 
lations among thoses entities. Similar to Hetionet, Drug Repurposing Knowledge 
Graph (DRKG) (Ioannidis et al, 2020) was built by integrating data from six differ- 
ent existing biomedical databases, containing 13 types of about 100K entities and 
107 types of over 5 million relationships. Zhu et al (2020b) constructed a drug- 
centric BKG by systematically integrating multiple drug databases such as Drug- 
Bank (Wishart et al, 2018) and PharmGKB (Whirl-Carrillo et al, 2012). Hetionet, 
DRKG, and BKGs have been used in accelerating computational drug repurpos- 
ing. PreMedKB (Yu et al, 2019b) includes the information of disease, genes, vari- 
ants, and drugs by integrating relational data among them from existing resources. 
By integrating multiple dietary related databases, Rizvi et al (2019) built a BKG, 
named Dietary Supplements Knowledge Base (iDISK), which covers knowledge of 
dietary supplements, including vitamins, herbs, minerals, etc. The Clinical Knowl- 
edge Graph (CKG)(Santos et al, 2020) was constructed by integrating relevant exist- 
ing biomedical databases such as DrugBank (Wishart et al, 2018), Disease Ontology 
(Schriml et al, 2019), SIDER (Kuhn et al, 2016), etc. and knowledge extracted from 
scientific literature. It contains over 16 million nodes and over 220 million relation- 
ships. Compared to other BKGs, CKG has a finer granularity of knowledge as it 
involves more entity types such as metabolite, modified protein, molecule function, 
transcript, genetic variant, food, clinical variable, etc. 

As the rapid development of biomedical research, a continuously increasing vol- 
ume of biomedical articles have been published every day. Manually extracting 
knowledge from literature for BKG cuuration is no longer sufficient to meet cur- 
rent needs. To this end, efforts have been made in using text mining methods to ex- 
tract biomedical knowledge from scientific literature to construct BKGs (Domingo- 
Fernandez et al, 2020; Wang et al, 2020e; Percha and Altman, 2018; Li et al, 
2020d). For example, Sun et al (2020a) constructed a knowledge graph by extracting 
biomedical entities and relationships from drug descriptions, medical dictionaries, 
and literature to identify suspected cases of Fraud, Waste, and Abuse from claim 
files. COVID-KG (Wang et al, 2020e) and COVID-19 Knowledge Graph (Domingo- 
Fernandez et al, 2020) were built by extracting COVID-19 specific knowledge from 
biomedical literature. The resulting COVID-19 specific BKGs contain entities such 
as diseases, chemicals, genes, and pathways, along with their relationships. KGHC 
(Li et al, 2020d) is a BKG with the specific focus on hepatocellular carcinoma. 
It was built by extracting knowedge from literature and contents on the internet, 
as well as structured triples from SemMedDB (Kilicoglu et al, 2012). In addition, 
some studies (Goodwin and Harabagiu, 2013; Li et al, 2020b; Rotmensch et al, 
2017; Sun et al, 2020a) tried to build BKGs from clinical data such as electronic 
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health records (EHRs) and electronic medical records (EMRs). For instance, Rot- 
mensch et al (2017) constructed a BKG by extracting disease-symptom associations 
from EHR data using the data-driven approach. Li et al (2020b) proposed a sys- 
tematic pipeline for extracting BKG from large scale EMR data. Compared to other 
BKGs based on triplet structure, the resulting KG is based a quadruplet structure, 
i.e., (head, relation, tail, property). Here the property includes information such as 
co-occurrence number, co-occurrence probablity, specificity, and reliability of the 
corresponding (head, relation, tail) triplet. 
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BKGs Entities Relations Focus Construction URL 
method 
Clinical 16 million entities 220 million General Resources https://github.com/ 
Knowledge from 33 entity types relations from 51 Integration MannLabs/CKG 
Graph (Santos relation types 
et al, 2020) 
Drug 97,238 entities from 5,874,261 relations General Resources https://github.com/ 
Repurposing 13 entity types from 107 relation Integration gnn4dr/DRKG 
Knowledge types 
Graph 
(Ioannidis 
et al, 2020) 
Hetionet 47,031 entities from 2,250,197 relations General Resources https://het.io/ 
(Himmelstein 11 entity types from 24 relation Integration 
et al, 2017) types 
iDISK (Rizvi 144,059 entities 708,164 relations Dietary Resources https://conservancy. 
et al, 2019) from 6 entity types from 6 relation Supplements Integration umn.edu/handle/ 
types 11299/204783 
PreMedKB 404,904 entities 496,689 relations General Resources https 
(Yu et al, from 4 entity types from 52 relation Integration //www.fudan-pgx.org/ 
2019b) types premedkb/index.html# 
/home 
Zhu et al 5 entity types 9 relation types General Resources - 
(2020b) Integration 
Zeng et al 145,179 entities 15,018,067 General Resources - 
(2020b) from 4 entity types relations from 39 Integration 
relation types 
COVID-19 3,954 entities from 9,484 relations COVID-19 Literature https: //github,com/ 
Knowledge 10 entity types Mining covidl9kg/covidl9kg 
Graph 
(Domingo- 
Fernandez 
et al, 2020) 
COVID-KG 67,217 entities from 85,126,762 COVID-19 Literature http://blender.cs. 
(Wang etal, 3 entity types relations from 3 Mining illinois.edu/ 
2020e) relation types covid19/ 
Global Three entity types 2,236,307 relations General Literature https://zenodo.org/ 
Network of (Chemical, Disease, from 36 relation Mining record/1035500 
Biomedical Gene) types 
Relationships 
(Percha and 
Altman, 2018) 
KGHC (Li 5,028 entities from 13,296 relations Hepatocellular Literature https//202.1178)..75:. 
et al, 2020d) 9 entity types Carcinoma Mining 18:18895/browser/ 
Liet al 22,508 entities from 579,094 relations General EHR Mining - 
(2020b) 9 entity types 
QMKG 634,000 entities 1,390,000,000 General EHR Mining - 
(Goodwin and relations 
Harabagiu, 
2013) 
Rotmensch 647 entities from 2 Disease-Symptom General EHR Mining - 
et al (2017) entity types 
Sun et al 1,616,549 entities 5,963,444 relations General EHR Mining https://web.archive. 
(2020a) from 62 entity types from 202 relation org/web/ 
types 20191 2311526151 F_ 


fhttp://121,12,.85. 
245:1347/kg_test/#/ 


522 Chang Su, Yu Hou, Fei Wang 


Table 24.2: Publicly available resources for BKG construction 


Database Entities Relations Short Description URL 
Bgee (Bastian 60,072 Anatomy 11,731,369 A database for https://bgee.org/ 
et al, 2021) and Gene entities relations in terms of Anatomy-Gene Expression 

presence/absence of 

expression 
Comparative 73,922 Disease, 38,344,568 A database that is manually http://ctdbase.org/ 
Toxicoge- Gene, Chemical, Chemical-Gene, curated includes chemical- 
nomics Pathway entities Chemical-Disease, disease-gene-pathway 
Database Chemical-Pathway, relations 
(Davis et al, Gene-Disease, 
2019) Gene-Pathway, and 

Disease-Pathway 

relations 
Drug—Gene 160,054 Drug and 96,924 Drug-Gene A database for drug-gene https: 
Interaction Gene entities Interaction relations interactions //www.dgidb.org/ 
Database 
(Cotto et al, 
2018) 
DISEASES 22,216 Disease and 543,405 relations A database for Disease-Gene https://diseases. 
(Pletscher- Gene entities Association jensenlab.org/ 
Frankild et al, 
2015) 
DisGeNET 159,052 Disease, 839,138 Gene- A database that integrates data https://www. 
(Piñero et al, Gene and Variant —_ Disease, Variant- from expert-curated disgenet.org/home/ 
2020) entities Disease relations repositories for genes and 

variants associated with 
human diseases. 

IntAct 119,281 Chemical 1,130,596 relations A database for molecular https://www.ebi.ac. 
(Orchard et al, and Gene entities interaction data uk/intact/ 
2014) 
STRING 24,584,628 Protein 3,123,056,667 A database for Protein-Protein https: 
(Szklarczyk entities Protein-Protein Interaction netword //string-db.org/ 


et al, 2019) Interaction relations 


SIDER (Kuhn 
et al, 2016) 


7,298 Drug and 
Side-effect entities 


139,756 Drug-Side 
effect relations 


A database contains 
medicines and their recorded 
adverse drug reactions 


http://sideeffects. 
embl.de/ 


SIGNOR 7,095 entities from 26,523 relations A database for signaling https://signor. 
(Licata et al, 10 entity types information published inthe uniroma2.it/ 
2020) scientific literature 
TISSUE 26,260 entities in 6,788,697 relations A database for Tissue-Gene https://tissues. 
(Palasca et al, Tissue and Gene Expression by literature jensenlab.org/ 
2018) curated manually 
DrugBank 15,128 Drug 28,014 Drug-Target, A database for the information https: 
(Wishart et al, entities Drug-Enzyme, on drugs and drug targets //go.drugbank.com/ 
2018) Drug-Carrier, 

Drug-Transporter 

relations 
KEGG 33,756,186 entities - A database for genomes, https: 
(Kanehisa and in Drug, Pathway, biological pathways, diseases, //www.kegg. jp/kegg/ 
Goto, 2000) Gene, etc. drugs, and chemical 

substances. 

PharmGKB 43,112 entities in 61,616 relations A database for drugs and https: 
(Whirl- Genes, Variant, drug-related relationships. //www.pharmgkb.org/ 
Carrillo et al, | Drug/Chemical and 
2012) Phenotype 
Reactome 21.087 Pathway - A manually curated database https: 
(Jassal et al, entities for peer-reviewed pathway //reactome.org/ 
2020) 
Semantic - 109,966,978 A database contains Semantic https://skr3.nlm. 
MEDLINE relations predictions from the literature nih.gov/index.html 
Database 
(Kilicoglu 
et al, 2012) 
Gene 44,085 Gene - An ontology the functions of http: 
Ontology entities genes //geneontology.org/ 
(Ashburner 


et al, 2000) 
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24.3 Inference on Knowledge Graphs 


In KG inference, one usually needs to address two important attributes of KGs: 1) 
the KG’s local and global structure properties, and 2) heterogeneity of entities and 
relations(Wang et al, 2017d; Cai et al, 2018b; Zhang et al, 2018c; Goyal and Ferrara, 
2018; Su et al, 2020c; Zhao et al, 2019d). In this context, a standard pipeline for KG 
inference typically contains two major steps: 1) learning embeddings (i.e., repre- 
sentation vectors) for entities (and relations) while preserving their structural prop- 
erties and entity and relation attributes in the KG; and 2) performing downstream 
tasks such as entity classification and link prediction using the learned embeddings. 
Of note, one can perform these two steps separately, but also build an end-to-end 
model that can jointly learn the embeddings and perform downstream tasks. In this 
section, we review the existing techniques for inference on KGs, including the con- 
ventional inference techniques and the GNN-based models. 


24.3.1 Conventional KG inference techniques 


This subsection surveys the conventional KG inference techniques. 

Semantic matching models typically exploit the similarity-based energy func- 
tions by matching latent semantics of entities and relations in the embedding spaces. 
A well-known semantic matching model, RESCAL (Nickel et al, 2011; Jenatton 
et al, 2012), was proposed based on the idea that entities are similar if connected to 
similar entities via similar relations(Nickel and Tresp, 2013). By associating each 
relation rą with a matrix Mg , it defines the energy function by a bilinear model 
fleirk ej) = h; M,hj, where h;,h; € IR¢ are d-dimensional embedding vectors for 
entities e; and e;, respectively. RESCAL jointly learns embedding results for entities 
by e; and e; and for relation by My. Another model, DistMult (Yang et al, 2015a) 
simplifies RESCAL by restricting matrix M+ for relation rg as a diagonal matrix. 
Though DistMult is more efficient than RESCAL, it can only deal with the undi- 
rected graphs. To address this, HolE (Nickel et al, 2016b) composes e; and e; by 
their circular correlation. Consequently, power of RESCAL and efficiency of Dist- 
Mult are inherited by HolE. Other semantic matching models refer to the neural 
network architecture by considering embedding as the input layer and energy func- 
tion as the output layer, such as the the semantic matching energy (SME) model 
(Bordes et al, 2014) and multi-layer perceptron (MLP) (Dong et al, 2014). 

Translational distance models are based on the idea that, for each triplet 
(ei, rx,e;) , the relation rg can be considered as a translation from head entity e; to tail 
entity e; in the embedding space. Accordingly, they exploit distance-based energy 
functions to model the triplets in KG. In this context, TransE (Bordes et al, 2013) is 
the famous pioneer of the translational distance model family. It typically represents 
relation rz as the translation vector gg, such that e; and e; are closely connected by 
rz. Therefore, the energy function is defined as f (e;,rg,ej) = ||hi +g +hj||2. Since 
all parameters to learn are entity and relation embedding vectors lying in a same 
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low-dimensional space, TransE is obviously easy to train. A drawback of TransE is 
that it cannot do well with N-to-1, N-to-1 and N-to- N structures in KGs. To address 
this issue, TransH (Wang et al, 2014) extends TransE by introducing a hyperplane 
for each relation rg and projecting e; and e; into the hyperplane before constructing 
the translation scheme. In this way, TransH improves model capacity while preserv- 
ing efficiency. Similarly, TransR (Lin et al, 2015) extends TransE by introducing 
the relation-specific space. Further, for more fine-grained embedding, TransD (Ji 
et al, 2015) extends TransE by constructing two matrices M : and M? for each r; to 
project e; and ej, respectively. Hence it captures both entity diversity and relation 
diversity. Further, TranSparse (Ji et al, 2016) simplifies TransR by using adaptive 
sparse matrices to model different types of relations, and TransF (Feng et al, 2016) 
relaxes the translation restriction as h; + g; ~ ahj. 

Meta-path-based approaches. A potential issue for both semantic matching 
models and translational distance models is that they mainly focus on one-hop in- 
formation (i.e., modeling neighboring entities within a triplet) and hence may ig- 
nore the global structure properties of KGs. To address this, the meta-path based 
models aim at capturing local and global structure properties, as well as entity and 
relation types for KG inference. Typically, a meta-path is defined as a sequence of 
node types separated by edge types (Sun et al, 2011). For example, a meta-path 


of length / is a, A az 2 = pis aj, where {a1,d2,...,a;} and {bj,bo,...,bj-1} are 
the sets of node type and relation type, respectively. Following this idea, Hetero- 
geneous Information Network Embedding (HINE) (Huang and Mamoulis, 2017) 
defines meta-path-based proximity. It preserves heterogeneous structure by mini- 
mizing the difference between meta-path-based proximity and expected proximity 
in the embedding space. Moreover, metapath2vec (Dong et al, 2017) formalizes 
meta-path-based random walks and extends the word embedding model SkipGram 
to learn entity embeddings, by considering each walk path as a sentence and entities 
as words. 

Convolutional neural network (CNN) models have also been used to address 
the KG inference task. For example, ConvE (Dettmers et al, 2018) uses CNN archi- 
tecture for link prediction in KGs. For each triplet (e;,rg,ej), ConvE first reshapes 
embedding vectors of e; and rą as two matrices and concatenate them. The resulting 
matrix is then fed to the convolutional layers to produce feature maps, which are 
then transformed into the entity embedding space to match the embedding of e;. In 
addition, ConvKB (Nguyen et al, 2017) directly concatenates embedding vectors of 
ei , rk, and ej, for each triplet (€:,17 K€ j) into a 3-column matrix. Then the matrix is 
fed to the convolutional layers to learn the entity and relation embeddings. 


24.3.2 GNN-based KG inference techniques 


This subsection discusses KG inference techniques based on the novel GNN archi- 
tectures. 
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Graph convolution network (GCN)-based architectures. A pioneer effort us- 
ing and extending GCN in KG inference is the Relational GCN (R-GCN) (Schlichtkrull 
et al, 2018). In contrast to the original application scenario, the structure property of 
a KG is usually heterogeneous as having diverse entity types and relation types. To 
address this, R-GCN introduces two subtle modifications on the regular GCN archi- 
tecture (Berg et al, 2017). Specifically, for each entity, instead of simply aggregating 
information from all of its neighbors, R-GCN uses a relation-specific transformation 
mechanism, which first gathers information from neighboring entities based on re- 
lation types and relation directions separately and then accumulates them together. 
Specifically, 


o| E E wn? wn? (24.1) 
ER je nk Cik 

Here h+) is the embedding vector of entity e; at the (/ + 1)-th graph convolutional 
layer. Z is the set of all relations and .% K is the neighbors of entity e; under rela- 
tion rg. The problem-specific normalization coefficient c;, can be either learned or 
pre-defined. Using softmax for each entity, R-GCN can be trained for entity clas- 
sification. In link prediction, R-GCN is used as an encoder for learning embedding 
vectors of the entities while the factorization model, DistMult, is used as the de- 
coder to predict missing links in the KG based on the learned entity embeddings. It 
resulted in a significantly improved performance compared to the baseline models 
like DistMult and TransE. 

Cai et al (2019) proposed the TransGCN, which combines the GCN architecture 
with the translational distance models (e.g., TransE and RotatE) for link prediction 
in KGs. Compared to R-GCN, TransGCN aims to address the link prediction task 
without a task-specific decoder like R-GCN and learn both entity embeddings and 
relation embeddings simultaneously. For each triplet (e;,7;,e;), TransGCN assumes 
that rz is the transformation from the head e; to the tail e; in the embedding space. 
Then it extends the GCN layer to update e;’s embedding as 


(4) _ 1 (0) QM 64 D D 
m= Mo L bog’ + = hog (24.2) 
lejrene (eine ee n 

by? =o (mi) +”) (24.3) 


where o and x are transformation operators that can be defined based on specific 


translational mechanism used. .4; (™) and WN, (ou) are incoming and outgoing triplet 


of e;, respectively. The normalization constant c;was defined by the total degree 
of entity ej. Meanwhile, embedding of each relation rą was updated by simply 
gt) = o(w® g”). The authors engaged two translational mechanisms, TransE 


and RotatE, and defined o , x, and scoring functions accordingly. Both result- 
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ing architectures, TransE-GCN and RotatE-GCN, showed higher performance than 
TransE, RotatE, and R-GCN in the experiments. 

Structure-Aware Convolutional Network (SACN) (Shang et al, 2019) is another 
architecture for knowledge graph inference based on GCN. Similar to R-GCN, it 
engaged a weighted graph convolutional network (WGCN) as the encoder to capture 
the structure property of the KG. WGCN considers a KG with multiple relation 
types as a combination of multiple sub-graphs with single relation type. Then, the 
embedding vector of each entity e; can be obtained by a weighted combination of 
information propagation based on each sub-graph, 


nit) -o( £ a ot, nw saw) (24.4) 


JEN 


where a!) is the weight of relation rx at the /-th layer. The learned embedding from 


WGCN was then fed to a decoder, Conv-TransE, a CNN with TransE’s translational 
mechanism, for link prediction. 

Graph attention network (GAT)-based architectures. A potential drawback 
of the GCN architectures is that, for each entity, they treat the neighbors equally 
to gather information. However, different neighboring entities, relations or triplets 
may have different importances in indicating a specific entity, and the weights of 
neighboring entities under the same relation may be also distinct. To address this, 
GATs have been used to involved in the KG inference problems. One of the early 
efforts is the GATE-KG (i.e., graph attention-based embedding in KG) (Nathani 
et al, 2019). It introduces an extended and generalized attention mechanism as the 
encoder to produce the entity and relation embeddings while capturing the diverse 
relation type in KG. For each triplet (e;,7%,e;), GATE-KG first produces a represen- 


; () or. 
tation vector €; jg of this triplet by 


ek) = WO hP blg] (24.5) 


Here || is the concatenation operation. The attention coefficient Œ; jg is obtained by 


Bij} = LeakyReLU (W4¢{/,) (24.6) 


exp(B) 
l 
Litem Erez, exp BY, ) 


al) = (24.7) 


where &;; is the set of all relations between e; and e; . By aggregating information 


from neighbors according to different relations, entity e;’s embedding vector ni!) 


at the (Z+ 1)-th layer can be calculated as 


n't) <6 ( EL ae) (24.8) 


JEM KER; 


24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 527 


In addition, by using the auxiliary relation between n-hop neighbors and itera- 
tively accumulating information of n-hop neighbors at the n-th graph attention layer, 
GATE-KG gives high weights to the 1-hop neighbors while lower weights to the n- 
hop neighbors. Hence it captures the multi-hop structure information of KG. 

Relational Graph neural network with Hierarchical ATtention (RGHAT) (Zhang 
et al, 20201) is another GAT-based model to address link prediction in KGs. Specif- 
ically, it engages a two-level attention mechanism. First, a relational-level attention 
defines the weight of each relation rz indicating a specific entity e; as 


aik = Wi (hy|| gx] (24.9) 


exp(0(Z1 - ajx)) 
Eren exp(o (21 - aix)) 


where zis a learnable parameter vector and o is LeakyReLU. .% is the neighboring 
relations of entity e;. Second, it defines an entity-level attention as 


Qik = (24.10) 


biz; = W [aix||hj] (24.11) 


_ _ &xp(o(z:bzj)) 
En cm, exp(o (zi :biyj)) 


Prj (24.12) 
where zzis a learnable parameter vector and .4%, denotes the set of tail entities of en- 
tity e; under relation rg. The final attention coefficient for gathering information via 
triplet (e;,7%,e;) is calculated as Liki = Cik - Prj. Similar to GATE-KG, the RGHAT 
engages ConvE as the decoder for link prediction. 

Wang et al (2019j) proposed the Knowledge Graph Attention Network (KGAT) 
for recommendation based on KG, which contains three types of layers. First, a 
embedding layer learns embeddings for entities and relations using TransR. Second, 
the attentive embedding propagation layers extend GAT to capture the high-order 
structure properties (i.e., multi-hop neighbor information) of KG. Specifically, they 
defined the attention coefficient for each triplet (e;,7,,e;), depending on distance 
between e; and e; in the rg’s space, i.e., 


Bij = (Wkh;) | tanh (Wihj + Bx) (24.13) 


exp (bj jx) 
Lren brez, exp Bix) 


Qi jk = (24.14) 
KGAT then stacks multiple attentive embedding propagation layers to capture infor- 
mation of multiple-hop neighbors of each entity, specifically, entity e;s embedding 


at the (L+ 1)-th layer, i.e., h+) =0 (W) , where ni’ = Leejrpe eM oih. 


Finally, a prediction layer concatenates embeddings at each graph attention layer for 
each entity to make prediction. 
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In addition, Heterogeneous graph Attention Network (HAN) (Wang et al, 2019m) 
uses GAT to address the node (i.e., entity) classification in the heterogeneous graphs 
(the KG can be considered as a specific type of heterogeneous graph). HAN couples 
graph attention mechanism with meta-paths to capture the heterogeneous structure 
properties. A hierarchical attention mechanism that contains a node-level attention 
and semantic-level attention was proposed. The node-level attention aims to learn 
the importance of the meta-path-based neighbors in indicating a node. Specifically, 
it first projects different types of entities into a same space by h; = M,,hi, where @; 
is the type of entity e;, and h; and h; are the projected and original embeddings of e;, 
respectively. It then calculates the attention weight a? of entity pair (e;,e;) under a 
specific meta-path ®, as 


of — __2xP(a5 [hh] 


= Lye exp(ag - [hi||hy]) 


where M is the neighbors of e; under meta-path ® and ag is the node-level at- 
tention vector. In addition, the semantic attention layer learns importance of each 
meta-path ® in the task (i.e., classification) by 


(24.15) 


1 
we =~ } q' -tanh(W-z? +b) (24.16) 
\Y| ere V 
where ¥ is all entities, q is the learnable semantic-level attention vector, and b is 
the bias. Then the semantic-level attention weight is calculated as Bo = pone. 


The final embeddings of all entities, Z = ).¢ BoZa, are used for classification. 


24.4 KG-based hypothesis generation in computational drug 
development 


Generally, the drug repurposing procedure includes three major steps: hypothesis 
generation, assessment, and validation (Pushpakom et al, 2019). Among them, the 
first and foremost step is hypothesis generation. Typially, the hypothesis generation 
for drug repurposing aims at identifying candidate drugs that has a high confidence 
to be associated with the therapeutic indication of interest. Today’s largly available 
BKGs, encoding huge volume of biomedical knowledge, have become a valuable 
resouce for drug repurposing. In KG, the hypothesis generation procedure can be 
formulated as a link prediction problem, i.e., computational identification of poten- 
tial drug-target or drug-disease associations with a high confidence level based on 
existing knowledge (KG’s structure properties). This section introduces some pre- 
liminary efforts of hypothesis generation for drug repurposing, using computational 
approaches in the BKGs. 
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24.4.1 A machine learning framework for KG-based drug 
repurposing 


One of the previous efforts using computational inference in BKG for drug repur- 
posing is Zhu et al.’s study (Zhu et al, 2020b). The main contributions of this study 
is two-fold: 1) KG construction via data integration, and 2) building the KG-based 
machine learning pipeline for drug repurposing. 

First, by integrating six drug knowledge bases, including PharmGKB (Whirl- 
Carrillo et al, 2012), TTD (Yang et al, 2016a), KEGG DRUG (Kanehisa et al, 
2007), DrugBank (Wishart et al, 2018), SIDER (Kuhn et al, 2016), and DID (Sharp, 
2017), they curated a drug-centric KG consisting of five entity types including drugs, 
diseases, genes, pathways, and side-effects and nine relation types including drug- 
disease TREATS, drug-drug INTERACTS, and drug-gene REGULATES, BINDS, 
and ASSOCIATES, drug-side effect CAUSES relations, gene-gene ASSOCIATES, 
gene-disease ASSOCIATES, and gene-pathway PARTICIPATES relations. 

Second, based on the drug-centric KG, a machine learning pipeline was built 
for drug repurposing. Specifically, the target of the proposed model was to pre- 
dict the existence of relation between a pair of drug and disease entities. In this 
way, the task fell into the supervised classification setting where the input sam- 
ples were the drug-disease pairs. To this end, representation for each sample (drug- 
disease pair) was calculated in two ways: 1) meta-path-based representation and 2) 
KG embedding-based representation. For meta-path-based representation, 99 pos- 


sible meta-paths between drugs and diseases with length 2-4 were enumerated, 


TREATS ASSOCIATES ,.. TREATS ASSOCIATES 
such as Drug —> Gene > Disease and Drug —> ` Gene => 


ASSOCIATES „n. : . f 
Gene > Disease. Then a 99-dimensional representation vector was cal- 


culated for a drug-disease pair, of which each element indicates the connectivity 
measure between this two entities based on a specific meta-path. In this study, four 
different connectivity measures were used, under a specific meta-path ®, including 


e Path count, PCo(ea4r,€qi), the number of paths between drug eg, and disease 


edi; 
e Head normalized path count HNPCgo = eti, 
e Tail normalized path count TNPCo = Polat). 


PCo (ear edi) 


e Normalized path count NPCg = PCs lear*)4PCo lea)? 


For KG embedding-based representation, three translational distance models, in- 
cluding TransE (Bordes et al, 2013), TransH (Wang et al, 2014), and TransR (Lin 
et al, 2015), were used. Specifically, for each pair of drug ey, and disease egi, using 
each of the three models, their embedding vectors hy, and hy; were first learned. 
Then representation of the drug-disease pair (eqr, €qi) was calculated by hg; — hg;. 
After that, a machine learning pipeline was built of which the input are repre- 
sentations of the drug-disease pairs. A drug-disease pair was labeled as positive if 
there is a relation between them. However, the drug-disease pair without a relation 
between them isn’t really negative, instead, it was marked as unknown/unlabeled. 
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To address this, a positive and unlabeled (PU) learning framework (Elkan and Noto, 
2008) was used. Decision Tree, Random Forest, and support vector machine (SVM) 
were used as basic classifiers of this PU learning framework, respectively. In this 
study, drug-disease relations related to eight diseases were used as the testing set, 
while the remaining drug-disease relations (positive) and 143,830 pairs associating 
the eight diseases with other drugs (unlabeled) were used as the training set. Ex- 
perimental results showed that the KG-driven pipeline can produce high prediction 
results on known diabetes mellitus treatments with only using treatment information 
of other diseases. 


24.4.2 Application of KG-based drug repurposing in COVID-19 


The sudden outbreak of the human coronavirus disease 2019 (COVID-19) has led 
to a pandemic that heavily strikes the healthcare system and tremendously impacts 
people’ life around the world. To date, many drugs have been under investigation to 
treat COVID-19, costing tremendous investment, however, very limited COVID-19 
antiviral medications are approved. In this context, there is the urgent need for a 
more efficient and effective way for drug development against the pandemic, and 
computational drug repurposing can be a promising approach to address this. 

Zeng et al.’s work (Zeng et al, 2020b) is a pioneer effort that computationally 
repurposes antiviral medications in COVID-19 based on KG inference. First of all, 
a comprehensive biomedical KG was constructed by integrating the two biomedi- 
cal relational data resources, Global Network of Biomedical Relationships (GNBR) 
(Percha and Altman, 2018) and DrugBank (Wishart et al, 2018), and experimen- 
tally discovered COVID-gene relationships (Zhou et al, 2020f), resulting in a KG 
consisting of 145,179 entities of four types (drugs, disease, genes, and drug side 
information) and 15,018,067 relationships of 39 types. Secondly, a deep KG em- 
bedding model, RotatE, was performed to learn low-dimensional representations 
for the entities and relations. Using such learned embedding vectors, the top 100 
drugs that are most close to the COVID-19 entity in the embedding space were pri- 
oritized as the candidate drugs. Using drugs in ongoing COVID-19 clinical trials 
(https://covidl9-trials.com/) as a validation set, the results achieved a 
desirable performance with an area under the receiver operating characteristic curve 
(AUROC) of 0.85. Moreover, gene set enrichment analysis (GSEA), which involved 
transcriptome data from peripheral blood and Calu-3 cells, and proteome data from 
Caco-2 cells, was performed to validate the candidate drugs. Finally, 41 drugs were 
identified as potential repurposable candidates for COVID-19 therapy, especially 
9 are under ongoing COVID-19 trials. Among the 41 candidates, three types of 
drugs were highlighted by the author: 1) the Anti-Inflammatory Agents such as 
dexamethasone, indomethacin, and melatonin; 2) the Selective Estrogen Receptor 
Modulators (SERMs) such as clomifene, bazedoxifene, and toremifene; and 3) the 
Antiparasitics including hydroxychloroquine and chloroquine phosphate. 
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Another work (Hsieh et al, 2020), has been focused on using GNN in KG to 
address the drug repurposing problem. By extracting and integrating drug-target 
interactions, pathways, gene/drug-phenotype interactions from CTD (Davis et al, 
2019), a SARS-CoV-2 KG was built, which consists of 27 SARS-CoV-2 baits, 5,677 
host genes, 3,635 drugs, and 1,285 phenotypes, as well as 330 virus-host protein- 
protein interactions, 13,423 gene-gene sharing pathway interactions, 16,972 drug- 
target interactions, 1,401 gene-phenotype associations, and 935 drug-phenotype as- 
sociations. Nest, a variational graph autoencoder (Kipf and Welling, 2016), which 
engages R-GCN (Schlichtkrull et al, 2018) as encoder, was used to learn entity em- 
beddings in the SARS-CoV-2 KG. Since the SARS-CoV-2 KG has a specific focus 
on COVID-19 related knowledge, some general yet meaningful biomedical knowl- 
edge may be missing. To address this, a transfer learning framework was introduced. 
Specifically, it first used entity embeddings of Zeng et al.’s work (Zeng et al, 2020b) 
that encode general biomedical knowledge to initialize entity embeddings in SARS- 
CoV-2 KG. Then the embeddings were fine-tuned in SARS-CoV-2 KG through the 
proposed GNN. Using a customized neural network ranking model, 300 drugs that 
are most relevant to the COVID-19 were selected as the candidate drugs. Similar to 
Zeng et al.’s work (Zeng et al, 2020b) , the authors engaged GSEA, retrospective in- 
vitro drug screening, and populiation-based treatment effect analysis in electronic 
health records (EHRs), to further validate the repurposable candidates. Through 
such a pipeline, 22 drugs were highlighted for potential COVID-19 treatment, in- 
cluding Azithromycin, Atorvastatin, Aspirin, Acetaminophen, and Albuterol. 

In summary, these studies shed light on the importance of the KG-based compu- 
tational approaches in drug repurposing to fight against the complex diseases like 
COVID-19. The reported good performance in terms of the high overlapping ratio 
between the repurposed candidate drug set and the drugs under ongoing COVID-19 
trials, not only demonstrated the effectiveness of the KG-based techniques but also 
provided biological evidence of the ongoing clinical trials. Moreover, they proposed 
feasible ways using other publicly available data to validate or refine the hypothesis 
derived from KGs, which therefore enhances the usability of KG-based approaches. 


24.5 Future directions 


KGs have been playing a more and more important role in biomedicine. An increas- 
ing number of KG-based machine learning and deep learning approaches have been 
used in biomedical studies such as hypothesis generation in computational drug de- 
velopment. As one of the latest advances in artificial intelligence (AI), GNNs, which 
have led to tremendous progress in image and text data mining (Kipf and Welling, 
2017b; Hamilton et al, 2017b; Veličković et al, 2018), have been introduced to ad- 
dress the KG inference problems. In this context, the use of GNN in biomedical KGs 
has a great potential in improving hypothesis generation in computational drug de- 
velopment. However, there remain significant gaps between the novel technique and 
the success of computational drug development. This section discusses the potential 
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opportunities and future research possibilities in this field toward improvements of 
hypothesis generation for computational drug development. 


24.5.1 KG quality control 


The procedures of constructing and curating a biomedical KG typically include 
manually gathering, annotating, and extracting knowledge from text (e.g., litera- 
ture or experimental reports), automatically or manually normalizing terminology 
to integrate multiple data resources, and automatically text mining for knowledge 
extraction, etc. However, none of them are perfect. Therefore, the quality issue has 
been challenging the KG inference approaches. In KG-based hypothesis generation 
for drug repurposing, a poor quality of KG will lead to uninformative or wrong 
representations and hence result in incorrect hypothesis generated (drug-disease as- 
sociations) and even failure of the entire drug repurposing project. Therefore, there 
is an urgent need for accurate and appropriate KG quality control. In general, there 
are two categories of quality issues in KGs: the incorrectness and incompleteness. 
Incorrectness refers to incorrect triplets in the KG, i.e., a triplet exists in KG but 
the corresponding relationship between the two entities is inconsistent with real- 
world evidence. To address this, a common strategy is manual annotation with sam- 
pled small subsets. Such a procedure is time- and cost-consuming, if one wants to 
evaluate sufficient triplets to reach the statistic criteria. To address this, for exam- 
ple, Gao et al (2019a) proposed an iterative evaluation framework for KG accuracy 
evaluation. Specifically, inspired by the properties of the annotation cost function 
observed in practice, the authors developed a cluster sampling strategy with un- 
equal probability theory. Their framework resulted in a 60% shrunk annotation cost 
and can be easily extended to address evolving KG. In addition, the use of well- 
designed biomedical vocabularies such as the Unified Medical Language System 
(UMLS) (Bodenreider, 2004) will improve entity term normalization and hence 
reduce the risk of errors caused by the ambiguous biomedical entities. Moreover, 
learning based on KG structure to refine the KG is also a potential way to solve this 
issue. Early efforts, such as (Zhao et al, 2020d), have been focused on this field. 
Incompleteness mainly refers to the missing of biologically or clinically mean- 
ingful triplets in the KG. To address the incompleteness in biomedical KG, a com- 
mon way is to integrate multiple data resources, biomedical data bases, and biomed- 
ical KGs to construct and curate a more comprehensive one. CKG (Santos et al, 
2020), Hetionet (Himmelstein et al, 2017), DRKG (Ioannidis et al, 2020), KG (Zhu 
et al, 2020b), etc. are good examples of this strategy. However, there is no guarantee 
that they are comprehensive enough to cover all biomedical knowledge. In addi- 
tion, today’s largely available biomedical literature and medical data (e.g., EHRs) 
are great treasure of biomedical knowledge. In this context, previous studies have 
been focused on deriving knowledge from biomedical literature (Zhao et al, 2020e; 
Xu et al, 2013; Zhang et al, 2018h; Sahu and Anand, 2018) and EHR data (Rot- 
mensch et al, 2017; Chen et al, 2020e), and the derived knowledge could be a good 
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complement for the biomedical KGs. Moreover, the computational methods such as 
the KG embedding models (e.g., TransE and TransH) and the GNNs (e.g., R-GCN) 
have been used in KG completion (Arora, 2020), which predict missing relations 
within a KG according to its structure properties. 


24.5.2 Scalable inference 


An ultimate goal of biomedical KGs is always to comprehensively incorporate the 
biomedical knowledge. For example, by integration of 26 publicly available biomed- 
ical databases, CKG (Santos et al, 2020) has included over 16 million biomedical 
entities connected by over 220 million relationships; another KG, DRKG (Ioanni- 
dis et al, 2020), integrating six databases and data collected from recent COVID-19 
publications, has included 10K entities and 5.8 million relationships. Meanwhile, to- 
day’s advanced high-throughput techniques as well as computer software and hard- 
ware have led to an inrush of a continuously increasing number of relational data 
interlinking biomedical entities like drugs, genes, proteins, chemical compounds, 
diseases and medical concepts extracted from clinical data. This largely enables us 
to extract knew knowledge to enrich the biomedical KGs and hence these KGs keep 
expanding constantly. 

In this context, the huge and even continuously increasing volume of KGs may 
challenge the computational models like GNNs. To this end, there is an urgent need 
for scalable techniques to address the high memory- and time-cost in KGs. For ex- 
ample, Deep Graph Library (DGL, https: //www.dgl.ai) (Wang et al, 2019f) 
is an open-source, free Python package designed by Amazon for facilitating the im- 
plementation of GNN family models, running on the top of several deep learning 
framework including PyTorch (Paszke et al, 2019), TensorFlow (Abadi et al, 2016), 
and MXNet (Chen et al, 2015). As of Mach 1, 2021, it has released the version 
0.6. By distilling GNN’s message passing procedure as the generalized sparse ten- 
sor operations, DGL provides the implementations of optimization techniques like 
kernel fusion, multi-thread and multi-process acceleration, and automatic sparse 
format tuning to speed up training process and reduce memory load. In addition 
to GNNs, DGL also released DGL-KE (https://github.com/awslabs/ 
dgl-ke) (Zheng et al, 2020c), an easy-to-use framework for implementation of 
KG representation models such as TransE, DistMult, RotatE, etc., which has been 
used in existing KG-based drug-repurposing studies such as (Zeng et al, 2020b). 


24.5.3 Coupling KGs with other biomedical data 


Apart from the KGs, there is an enormous volume of other biomedical data available 
such as clinical data and omics data, which are also promising resources for compu- 
tational drug repurposing. The clinical data is an important resource for healthcare 
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Figure 24.1: Coupling biomedical KGs with other biomedical data resources for 
improving computational drug development. 


and medical research, mainly including EHR data, claim data, and clinical trial data, 
etc. The EHR data is routinely collected during the daily patient care, containing het- 
erogeneous information of the patients, such as demographics, diagnoses, laboratory 
test results, medications, and clinical notes. Such rich information makes it possible 
for tracking patient’s health condition changes, medication prescriptions, and clini- 
cal outcomes. In addition, a tremendous volume of EHR data has been collected and 
the volume is rapidly increasing, which largely strengthens the statistical power for 
EHR-based analysis. For this reason, beyond its common usage such as diagnostic 
and prognostic prediction (Xiao et al, 2018; Si et al, 2020; Su et al, 2020e,a), and 
phenotyping (Chiu and Hripcsak, 2017; Weng et al, 2020; Su et al, 2020d, 2021), 
EHR data has been used for computational drug repurposing (Hurle et al, 2013; 
Pushpakom et al, 2019). For example, Wu et al (2019d) identified some non-cancer 
drugs as the repurposable candidates to treat cancer using EHR; Gurwitz (Gurwitz, 
2020) analyized EHR data to repurpose drugs for treating COVID-19. 

Advanced by the high throughput sequencing techniques, an enormous volume 
of omics data, including genomics, proteomics, transcriptomics, epigenomics, and 
metabolomics, have been collected and publicly available for analysis. Integrating 
and analyzing the omics data enable us to derive new biomedical insights and better 
understand human health and diseases at the molecular level (Subramanian et al, 
2020; Nicora et al, 2020; Su et al, 2020b). Due to the wealth of the omics data, it 
has also been involved in computational drug development (Pantziarka and Meheus, 
2018; Nicora et al, 2020; Issa et al, 2020). For example, via mining multiple omics 
data, Zhang et al (201 6c) identified 18 proteins as the potential anti-Alzheimer’s dis- 
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ease (AD) targets and prioritized 7 repurposable drugs inhibiting the targets. Mokou 
et al (2020) proposed a drug repurposing pipeline in bladder cancer based on pa- 
tients’ omics (proteomics and transcriptomics) signature data. 

In this context, combining KGs, clinical data, and multi-omics data and jointly 
learning them is a promising route to advance computational drug development (Fig. 
24.1). The benefits of combining of these data for inference can be two-way. First, 
computational models in clinical data and multi-omics data usually suffer from the 
data quality such as noise and limited cohort size especially for the population of a 
rare disease and model interpretability. The incorporation of KGs has been demon- 
strated to be able to address these issues effectively and accelerate the clinical data 
and omics data analysis. For example, Nelson et al (2019) linked EHR data with a 
biomedical KG and learned a barcode vector for each specific cohort (e.g., the obese 
cohort), which encodes both KG structure and EHR information and illustrates the 
importance of each biomedical entity (e.g., genes, symptoms, and medications) in 
indicating the cohort. Such cohort-specific barcode vectors further showed the ef- 
fectiveness in link prediction (e.g., disease-gene associations prediction). Wang et al 
(2017c) bridged patient EHR data with the BKG and extended the KG embedding 
model for safe medicine recommendation, which comprehensively considered rel- 
evant knowledge such as drug-drug interactions. In addition, Santos et al (2020) 
developed an open platform that couples the CKG (i.e., Clinical Knowledge Graph) 
with the typical proteomics workflows. In this way, CKG facilitates analysis and 
interpretation of the protomics data. Second, the incorporation of clinical data and 
omics data can potentially improve KG inference. Current KG-based drug repurpos- 
ing studies have involved the clinical data and omics data (Zeng et al, 2020b; Hsieh 
et al, 2020), which were typically used in an independent validation procedure to 
validate/refine the generated new hypotheses (i.e., novel disease-drug associations) 
. Moreover, previous studies have showcased that leveraging the clinical data (Rot- 
mensch et al, 2017; Chen et al, 2020e; Pan et al, 2020c) and omics data (Ramos et al, 
2019) can derive new knowledge. Therefore, we believe that incorporating clinical 
data and omics data in KG inference may largely reduce the impacts of KG quality 
issues especially the incompleteness. In total, when we design the next-generation 
GNN models for drug-repurposing, a considerable direction is the feasible and flex- 
ible architecture that can subtly harness KGs, clinical data, and multi-omics data to 
recursively improve each other. 


Editor’s Notes: Drug hypothesis generation aims to use biological and clin- 
ical knowledge to generate biomedical molecules. This knowledge is effec- 
tively stored in the form of knowledge graph (KG). The construction of KG 
is relevant to graph generation (Chapter 11) and some applications, such 
as text mining (Chapter 21). Based on KG, hypothesis generation process 
mainly contains graph representation learning (Chapter 2) and graph struc- 
ture learning (Chapter 14). It can also be formulated as the link prediction 
(Chapter 10) problem and calculate the confidence level of candidate drugs. 
The future direction of drug developments focuses on scalability (Chapter 
6) and interpretability (Chapter 7). 
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A 
Chapter 25 updaies | 
Graph Neural Networks in Predicting Protein 
Function and Interactions 


Anowarul Kabir and Amarda Shehu 


Abstract Graph Neural Networks (GNNs) are becoming increasingly popular and 
powerful tools in molecular modeling research due to their ability to operate over 
non-Euclidean data, such as graphs. Because of their ability to embed both the inher- 
ent structure and preserve the semantic information in a graph, GNNs are advancing 
diverse molecular structure-function studies. In this chapter, we focus on GNN- 
aided studies that bring together one or more protein-centric sources of data with 
the goal of elucidating protein function. We provide a short survey on GNNs and 
their most successful, recent variants designed to tackle the related problems of pre- 
dicting the biological function and molecular interactions of protein molecules. We 
review the latest methodological advances, discoveries, as well as open challenges 
promising to spur further research. 


25.1 From Protein Interactions to Function: An Introduction 


Molecular biology is now reaping the benefits of big data, as rapidly advancing 
high-throughput, automated wet-laboratory protocols have resulted in a vast amount 
of biological sequence, expression, interactions, and structure data (Stark, 2006; 
Zoete et al, 2011; Finn et al, 2013; Sterling and Irwin, 2015; Dana et al, 2018; 
Doncheva et al, 2018). Since functional characterization has lagged behind, we now 
have millions of protein products in databases for which no functional information 
is readily available; that is, we do not know what many of the proteins in our cells 
do (Gligorijevic et al, 2020). 
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Answering the question of what function a protein molecule performs is key not 
only to understanding our biology and protein-centric disorders, but also to advanc- 
ing protein-targeted therapies. Hence, this question remains the driver of much wet- 
and dry-laboratory research in molecular biology (Radivojac et al, 2013; Jiang et al, 
2016). Answering it can take many forms based on the detail sought or possible. The 
highest amount of detail provides an answer to the question by directly exposing the 
other molecules with which a target protein interacts in the cell, thus revealing what 
a protein does by elucidating the molecular partners to which it binds. 

In this brief survey, we focus on how graph neural networks (GNNs) are ad- 
vancing our ability to answer this question in silico. This chapter is organized as 
follows: First, a brief historical overview is provided, so that the reader understands 
the evolution of ideas and data that have made possible the application of machine 
learning to the problem of protein function prediction. Then, a brief overview of the 
(shallow) models prior to GNNs is provided. The rest of the survey is devoted to 
the GNN-based formulation of this question, a summary of state-of-the-art (SOTA) 
GNN-based methods, with a few selected methods highlighted where relevant, and 
an exposition of remaining challenges and potential ways forward via GNNs. 


25.1.1 Enter Stage Left: Protein-Protein Interaction Networks 


Historically, the earliest methods devised for protein function prediction related pro- 
tein sequence similarity to protein function similarity. This led to important discov- 
eries until remote homologs were identified, which are proteins with low sequence 
similarity but highly similar three-dimensional/tertiary structure and function. So 
methods evolved to utilize tertiary structure, but their applicability was limited, as 
determination of tertiary structure was and remains a laborious process. Other meth- 
ods utilized patterns in gene expression data to infer interacting proteins, based on 
the insight that proteins interacting with one another need foremost to be expressed 
in the cell at the same time. 

With the development of high-throughput technologies, such as two-hybrid anal- 
ysis for the yeast protein interactome (Ito et al, 2001), tandem-affinity purifi- 
cation and mass spectrometry (TAP-MS) (Gavin et al, 2002) for characterizing 
multi-protein complexes and protein-protein associations (Huang et al, 2016a), 
high-throughput mass spectrometric protein complex identification (HMS-PCI) (Ho 
et al, 2002), co-immunoprecipitation coupled to mass spectrometry (Foltman and 
Sanchez-Diaz, 2016), protein-protein interaction (PPI) data suddenly became avail- 
able, and in large amounts. PPI networks, with edges denoting interacting protein 
nodes, of many species, such as human, yeast, mouse, and others, suddenly became 
available to researchers. PPI networks, as small as a few nodes or as large as tens 
of thousands of nodes, gave a boost to machine learning methods and improved the 
performance of shallow models. Surveys such as Ref. (Shehu et al, 2016) provide a 
detailed history of the evolution of protein function prediction methods as different 
sources of wet-laboratory data became available to computational biologists. 
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25.1.2 Problem Formulation(s), Assumptions, and Noise: A 
Historical Perspective 


A natural question arises. If we have access to PPI data, then what else remains 
to predict with regards to protein function? Despite significant progress, the reality 
remains that there are many unmapped PPIs. This is formally known as the link pre- 
diction problem. For various reasons, PPI networks are incomplete. They entirely 
miss information on a protein, or they may contain incomplete information on a 
protein. In particular, we now know that PPIs suffer from high type-I error, type-II 
error, and low inclusion (Luo et al, 2015; Byron and Vestergaard, 2015). The to- 
tal number of PPI links that are experimentally determined is still moderate (Han 
et al, 2005). PPI data are inherently noisy as experimental methods often produce 
false-positive results (Hashemifar et al, 2018). Therefore, predicting protein func- 
tion computationally remains an essential task. 

The problem of protein function prediction is often formulated as that of link 
prediction, that is, predicting whether or not there exists a connection between two 
nodes in a given PPI network. While link prediction methods connect proteins on 
the basis of biological or network-based similarity, researchers report that inter- 
acting proteins are not necessarily similar and similar proteins do not necessarily 
interact (Kovacs et al, 2019). 

As indicated above, information on protein function can be provided at different 
levels of detail. There are several widely-used protein function annotation schemes, 
including the Gene Ontology (Lovell et al, 2003) (GO) Consortium, the Kyoto En- 
cyclopedia of Genes and Genomes (Wang and Dunbrack, 2003) (KEGG), the En- 
zyme Commission (Rhodes, 2010) (EC) numbers, the Human Phenotype Ontol- 
ogy (Robinson et al, 2008), and others. It is beyond the scope of this paper to provide 
an explanation of these ontologies. However, we emphasize that the most popular 
one remains the GO annotation, which classifies proteins into hierarchically-related 
functional classes organized into 3 different ontologies: Molecular Function (MF), 
Biological Process (BP), and Cellular Component (CC), to describe different aspects 
of protein functions. Systematic benchmarking efforts via the Critical Assessment 
of Functional Annotation (CAFA) community-wide experiments (Radivojac et al, 
2013; Jiang et al, 2016; Zhou et al, 2019b) and MouseFunc (Pefia-Castillo et al, 
2008) have been central to the automation of protein function annotation and rigor- 
ous assessment of devised methodologies. 


25.1.3 Shallow Machine Learning Models over the Years 


Many shallow machine learning approaches have been developed over the years. 
Xue-Wen and Mei propose a domain-based random forest of decision trees to infer 
protein interactions on the Saccharomyces cerevisiae dataset (Chen and Liu, 2005). 
Shinsuke et al. apply multiple support vector machines (SVMs) for predicting in- 
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teractions between pairs of yeast proteins and pairs of human proteins by increas- 
ing more negative pairs than positives (Dohkan et al, 2006). Fiona et al. assess 
naive bayes (NB), multi-layer perceptron (MLP) and k-nearest neighbour (KNN) 
methods on diverse, large-scale functional data to infer pairwise (PW) and module- 
based (MB) interaction networks (Browne et al, 2007). PRED_PPI provides a server 
developed on SVM for predicting PPIs in five organisms, such as humans, yeast, 
Drosophila, Escherichia coli, and Caenorhabditis elegans (Guo et al, 2010). Xiao- 
tong and Xue-wen integrate features extracted from microarray expression measure- 
ments, GO labels and orthologous scores, and apply a tree-augmented NB classifier 
for human PPI predictions from model organisms (Lin and Chen, 2012). Zhu-Hong 
et al. propose a multi-scale local descriptor feature representation scheme to ex- 
tract features from a protein sequence and use random forest (You et al, 2015a). 
Zhu-Hong et al. propose to apply SVM on a matrix-based representation of protein 
sequence, which fully considers the sequence order and dipeptide information of the 
protein primary sequence to detect PPIs (You et al, 2015b). 

Although many advances were made by shallow models, as summarized in Ta- 
ble 25.1, the problem of protein function prediction is still a long way from being 
solved. Shallow machine learning methods depend greatly on feature extraction and 
feature computation, which hinder performance. The task of feature engineering, 
particularly when integrating different sources of data (sequence, expression, in- 
teractions) is complex, laborious, and ultimately limited by human creativity and 
domain-specific understanding of what may be determinants of protein function. In 
particular, feature-based shallow models cannot fully incorporate the rich, local and 
distal topological information present in one or more PPI networks. These reasons 
have prompted researchers to investigate GNNs for protein function prediction. 


25.1.4 Enter Stage Right: Graph Neural Networks 


This section first relates a general formulation of a GNN and forsakes detail in the 
interest of space, assuming readers are already somewhat familiar with GNNs. The 
rest of the section focuses on three task-specific formulations that allow leveraging 
GNNs for protein function prediction. 


25.1.4.1 Preliminaries 


Assume an undirected and unweighted molecular-interaction graph, i.e., a PPI net- 
work, is represented by = (V,&), where Y and & denote the set of vertices 
representing proteins and the edges indicating interactions among proteins, respec- 
tively. Let the i-th protein be represented as an m-dimensional feature vector; that 
is, pj € R”. The objective of a GNN is to learn an embedding, h;, using the message 
passing protocol which essentially aggregates and transforms neighboring informa- 
tion to update the current node’s vector representation. Assuming f and g are two 
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Table 25.1: Summary of performance of shallow models as reported in (Chen and 
Liu, 2005; Guo et al, 2010; Lin and Chen, 2012; You et al, 2015a,b) 


Literature Model Dataset Sensitivity Specificity Accuracy 
(%) (%) (%) 
Chen and Liu (2005) RF Saccharomyces 79.78 64.38 NA* 
cerevisiae 
Guo et al (2010) SVM Human 89.17 92.17 90.67 
Yeast 88.17 89.81 88.99 


Drosophila 99.53 80.65 90.09 
Escherichia coli 95.11 90.35 92.73 
Caenorhabditis 96.46 98.55 97.51 


elegans 
Lin and Chen (2012) Tree- Human 88 70 NA* 
Augmented 
Naive Bayes 
(TAN) 
You et al (2015a) RF Saccharomyces 94.34 NA* 94.72 
cerevisiae 
You et al (2015b) SVM Saccharomyces 85.74 94.37 90.06 
cerevisiae 


* Not available 


parametric functions that compute the embedding and output considering a single 
protein, following (Scarselli et al, 2008), we formulate follows: 


hi = f (Pi, Peli}, Pnelil, Aneli) (25.1) 


Oi = g(hi, pi) (25.2) 


where Pi, Peji» Pnei] Ad Anej denote the feature representation of the i-th protein, 
features of all connected edges to the i-th protein, neighboring proteins’ features and 
embeddings of neighborhood proteins of the i-th protein, respectively. 

Let us now consider |#| = n proteins. All proteins are represented as a matrix, 
P € R”*”. The adjacency matrix A € R’*” encodes the connectivity of the proteins; 
namely, A; j; indicates whether or not there exists a link between proteins i and j. 
Enforcing the self-loops with each protein, the updated adjacent matrix is Â = A +7. 
The degree diagonal matrix, D, can then be defined, such that D; ; = L= Aj, j- From 
there, one can compute the symmetric Laplacian matrix L = D — A. Finally, one can 
then formulate the following iterative process: 


H+! = F(H", P||A||L||X) (25.3) 
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O=G(H,P\|A||L||X) (25.4) 


where H’ denotes t-th iteration of H, (-||-) indicates the aggregation operation based 
on the task at hand, and O is the final stacked output. 


25.1.4.2 GNNs for Representation Learning 


We now want to encode complex high-dimensional information, such as a protein, 
P, or a biological interaction, A, or an interaction network, Y, into low-dimensional 
embeddings, Z, by capturing linearity and non-linearity among nodes and edges. 
In principle, the representation should contain all the information for downstream 
machine learning tasks, such as link prediction, protein classification, protein cluster 
analysis, interaction prediction, etc. 

Suppose we want to learn a graph embedding, Z, from the network Y. A graph 
auto-encoder neural network (Kipf and Welling, 2016) can be applied to learn Z: 


Z = GNN (P,A; Benn) (25.5) 


where gnn denotes GNN (encoder)-specific learnable parameters. 


25.1.4.3 GNNs for the Link Prediction Problem 


Given two proteins, we want to predict if there is a link between them, where prob- 
ability p(A; j) 1 indicates there exists an interaction with high confidence; con- 
versely p(Aj,;) ~ 0 indicates a low interaction confidence. The prediction of a link 
between two given proteins can bet set up as a binary classification problem. The 
relations among nodes can be of several types; so, an edge of type r from node u to 
v can be defined as u > v € & , which can be formulated as a multi-relational link 
prediction problem. 

Using GNNs, one can map graph nodes into a low-dimensional vector space 
which may preserve both local graph structure and dissimilarities among node fea- 
tures. To address link prediction, one can employ a two layer encoder-decoder ap- 
proach where the model learns Z from equation 25.5: 


A’ = DECODER(Z)P,A; Odecoder) (25.6) 


where decoder denotes decoder (task)-specific learnable parameters, and A; : indi- 
cates the confidence score with the predicted link between protein i and j. 
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25.1.4.4 GNNs for Automated Function Prediction as a Multi-label 
Classification Problem 


Given n-GO terms and m-proteins, u = m — l proteins need to be annotated with 
term(s), whereas / proteins are already annotated. So for the i-th protein, the pre- 
diction will be y; = Y; 1,Y;2 -Yin Where y; ; € {0, 1}. This task can be considered 
as a binary multi-label classification problem, since a protein usually participates 
in multiple biological functions. This could be protein-centric, where GO-terms are 
annotated for each protein, or GO-term centric, where proteins are annotated for 
each GO-term, or protein-term pair centric, where a probability association score is 
predicted for each pair. 


25.2 Highlighted Case Studies 


In the following, we highlight three selected methods that exemplify SOTA tech- 
niques and performance. 


25.2.1 Case Study 1: Prediction of Protein-Protein and 
Protein-Drug Interactions: The Link Prediction Problem 


Liu et al (2019) apply a graph convolutional neural network (GCN) for PPI pre- 
diction as a supervised binary classification task. Learned representations of two 
proteins are fed to the model, and the model predicts the probability of interaction 
between the proteins. The model first captures position-specific information inside 
the PPI network and combines amino-acid sequence information to output final em- 
beddings for each protein. The model encodes each amino acid as a one-hot vector 
and employs a graph convolutional layer to learn a hidden representation from the 
graph. To do that, Liu et al (2019) use the message passing protocol to update each 
protein embedding by aggregating the original features and first-hop neighbors’ in- 
formation, which is formulated as following: 


Xı = ReLU(D~'AXyWo) (25.7) 


where Xo € R”*” is the original protein feature matrix which is an identity matrix; 
X, € R’*/ is the final output feature matrix, where f is the feature dimension of 
each protein after the graph convolution operation and Wo is the trainable weight 
matrix. In the prediction phase, the authors utilize fully connected layers followed 
by batch normalization and dropout layers to extract high-level features; softmax is 
then used to predict the final interaction probability score. The experiments show 
that the method achieves mean AUPR (area under precision-recall curve) of 0.52 
and 0.45 on yeast and human datasets, respectively, which outperforms sequence- 
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based SOTA methods. Additionally, the authors report achieving 95% accuracy on 
yeast data under 93% sensitivity. Therefore, the extracted information from the PPI 
graph suggests that a single graph convolutional layer is capable of extracting useful 
information for the PPI prediction task. 

Brockschmidt (Brockschmidt, 2020) proposes a novel GNN variant using feature- 
wise linear modulation (GNN-FiLM), originally introduced by Perez et al. (Perez 
et al, 2018) in the visual question-answering domain, and evaluates on three differ- 
ent tasks, including node-level classification of PPI networks. The targeted appli- 
cation in this work is the classification of proteins into known protein families or 
super-families, which is of great importance in numerous application domains, such 
as precision drug design. Typically, in GNN variants, the information is passed from 
the source to the target node considering the learned weights and the representation 
of the source node. However, the GNN-FiLM method proposes a hypernetwork, 
neural networks, that compute parameters for other networks (Ha et al, 2017), in 
graph settings, where the feature weights are learned dynamically based on the in- 
formation that the target node holds. Therefore, considering function g as a learnable 
function to compute the parameters for the affine transformation, the update rule is 
defined for the /-th layer as follows: 


BI = a:p) 053) 
mY =o | E OWAD +BY (25.9) 
u—>ve& 


where g is implemented as a single linear layer in practice considering ps and 
y as the hyperparameters of the message passing operation in GNN, and u Sy 
indicates that message is passing from u to v through a type r edge. In experiments, 
GNN-FiLM achieves micro-averaged F1 score of 99% which outperforms other 
variants when evaluated on protein classification tasks. 

Zitnik et al (2018) employ GCNs to predict polypharmacy side effects, which 
emerge from drug-drug interactions when using drug combinations on patients’ 
treatments. The problem can be formulated as a multi-relational link prediction 
problem in multimodal graph structured data. Specifically, Zitnik et al (2018) 
consider two types of nodes, proteins and drugs, and construct the network using 
protein-protein, protein-drug, and drug-drug interactions as polypharmacy side ef- 
fects, whereas each side effect can be of different types of edges, called Decagon. 
More precisely, a relation of type r between two nodes (proteins or drugs), u and v, is 
defined as (u,r,v) € £. Here, the relations can be a side effect between two proteins, 
binding affinity of two proteins, or relation between a protein and a drug. More 
formally, given a drug pair (u,v), the task is to predict the likelihood of an edge, 
Auv = (u,r,v). For this purpose, they develop a non-linear and multi-layer graph 
convolutional encoder to compute the embeddings of each node using original node 
features, called Decagon. To update a node’s representation, authors transform the 
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information of neighboring nodes by aggregation and propagation operations over 
the edges. The update operator is defined using the following rule: 


pD oE E ctw 4 cin (25.10) 


r jemi 
(1) 


where @ denotes non-linear activation function, h; 
node at the /-th layer, w ) means relation- type specific learnable parameter matrix, 
i 1 


1 
j € 4%! are the neighboring nodes of i, c; = ——— and ci = are the 
4 ZAA T yI% 


normalization constant. Finally, a tensor factorization model is used to predict the 
polypharmacy side effects using these embeddings. The probability of a link of type 
r between node u and v is defined as: 


indicates hidden state of the i-th 


u,v 


x” = o(g(u,r,v)) (25.11) 


where o is the sigmoid function and g is defined as follows: 


TD,RD,z, ifu and v both denote d d 
z Zy 1f u and v DO enote rug no es (25.12) 


zI M,Zy if any of u or v is not drug node 


where D,, R and M, are parameter matrices, such that D, defines side-effect-specific 
diagonal matrix, R is global drug-drug interaction matrix, and M, is relation-type- 
specific parameter matrix. Decagon achieves an AUPR of 83% under 80% precision, 
outperforming other baselines by up to 69%. The authors attribute the large margin 
in improvement to two components, the graph-structured convolution encoder and 
the tensor factorization model. 


25.2.2 Case Study 2: Prediction of Protein Function and 
Functionally-important Residues 


Automated Function Prediction (AFP) problems are often formulated as a multi- 
label classification problems and are more nuanced than predicting interactions be- 
tween two proteins. Many works report that proteins connected in the same molec- 
ular network share the same functions (Schwikowski et al, 2000), but recent de- 
velopments show that interacting proteins are not necessarily similar, and similar 
proteins do not necessarily interact (Kovács et al, 2019). Moreover, more than 80% 
of proteins interact with other molecules while functioning (Berggard et al, 2007). 
Therefore, identifying or predicting the roles of proteins in organisms is vital, and 
community-wide challenges have been organized to advance research towards this 
goal. These include the Critical Assessment of Function Annotation (CAFA) (Radi- 
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vojac et al, 2013; Jiang et al, 2016; Zhou et al, 2019b) and MouseFunc (Pefia- 
Castillo et al, 2008). 

Many computation methods have been developed to this end to analyze protein- 
function relationships. Traditional machine learning approaches, such as SVMs (Guan 
et al, 2008; Wass et al, 2012; Cozzetto et al, 2016), heuristic-based methods (Schug, 
2002), high dimensional statistical methods (Koo and Bonneau, 2018), and hierar- 
chical supervised clustering methods (Das et al, 2015) have been extensively stud- 
ied in AFP tasks and found that integration of several features, such as gene and 
protein network or structure outperforms sequence-based features. However, these 
traditional approaches rely strongly on hand-engineered features. 

Deep learning methods have become prevalent. For example, DeepSite (Jiménez 
et al, 2017), Torng and Altman (2018), and Enzynet (Amidi et al, 2018) ap- 
ply 3D convolutonal neural networks (CNNs) for feature extraction and predic- 
tion from protein structure data. However, storing the high-resolution 3D represen- 
tation of protein structure and applying 3D convolutions over the representation 
is inefficient (Gligorijevic et al, 2020). Very recently, GCNs (Kipf and Welling, 
2017b) (Henaff et al, 2015; Bronstein et al, 2017) have been shown to general- 
ize convolutional operations on graph-like molecular representations and overcome 
these limitations. 

In particular, Ioannidis et al (2019) adapt the graph residual neural network 
(GRNN) approach for a semi-supervised learning task over multi-relational PPI 
graphs to address AFP. The authors formulate a multi-relational connectivity graph 
as an n Xn X I tensor S, where S, „w; captures the edge between proteins vn and 
v, for the i-th relation. The n proteins are encoded in a feature matrix X € R”S, 
where the i-th protein is represented as an f x 1 feature vector. Furthermore, a label 
matrix Y € R"** encodes the k labels. Subsets of proteins are associated with true 
labels, and the task is to predict the labels of proteins with unavailable labels. The 
neighborhood aggregation for the n-th protein and the i-th relation at the /-th layer 
is defined by the following formula: 


m= YL ar (25.13) 


ni 
nen? 


where n’ denotes the neighboring nodes of the n-th protein, and Aa denotes the 
feature vector of the n-th protein in the i-th relation at the /-th to the first layer. 
Neighboring nodes are defined as one-hop only, which essentially incorporates one- 
hop diffusion. However, successive operations eventually spread the information 
across the network. To apply multi-relational graphs, the authors combine HO 


across i as follows: 
l L Dl 
G6) = ¥ ROHO, (25.14) 
/=1 


(!) 


where R; i is the learnable parameter. Then, a linear operation mixes the extracted 


features as follows: 
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ZO = GF w0 -1 (25.15) 


LT nt 


where W,,; is the learnable parameter. In summary, the neighborhood convolution 
and propagation step can be shown as: 


Z = ¢(Z"-Y, of) (25.16) 


where ef!) is comprised of two weight matrices, W and R, which linearly combine 
the information of neighboring nodes and the multi-relational information, respec- 
tively. Moreover, the authors incorporate residual connection to diffuse the input, X, 


across L-hop neighborhoods to capture multi-type diffusion; that is: 


2 = g(Z"-9; 0) + £(x5 00") (25.17) 


A softmax classification layer is used for the final prediction. The authors apply 
this model on three multi-relational networks, comprising generic, brain, and circu- 
lation cells. The model is shown to perform better than general graph convolutional 
neural networks. 

Recently, Gligorijevic et al (2020) employ DeepFRI, based on GCNs, for func- 
tionally annotating protein sequences and structures. DeepFRI outputs probabil- 
ities for each function. A Long Short-Term Memory Language Model (LSTM- 
LM) (Graves, 2013) is pretrained on around 10 million protein sequences from 
protein family database (Pfam) (Finn et al, 2013) to extract residue-level position- 
context features. The following equation is used: 


H? = H” — ReLU(HW™ + xw* +b) (25.18) 


where H° is the final residue-level feature representation and the first graph con- 
volutional layer. WM, W* and b are learnable parameters trained with the graph 
convolutional layers. Contact-map features, which encode tertiary protein structure, 
combined with LSTM-LM task-agnostic sequence-embeddings are fed to a GCN 
while keeping LSTM-LM frozen. The /-th layer of the convolution takes sequence- 
embeddings and the contact map A and outputs residue-level embeddings to the 
next, (l+ 1)-th, layer. Residue level features are extracted by propagating residue 
information to proximal residues. The rule for updating the node representation is: 


H+) = ReLU(D-2AD- 2 HOW) (25.19) 


The features are then concatenated into a single feature matrix as a protein embed- 
ding. Intuitively, embeddings from different layers can be thought as context-aware 
features. Additionally, the feature extraction strategy exploits linear or non-linear 
relationships from neighbouring residues, as well as residues distant in sequence 
but proximal in structure. 

The learned protein representation is fed into two consecutive fully connected 
layers to obtain predictions as class probabilities for all the GO-terms. The au- 
thors evaluate their model on experimental and predicted structures and compare 
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with existing baseline models, including CAFA-like BLAST (Wass et al, 2012) and 
CNN-based sequence-only DeepGOPlus (Kulmanov and Hoehndorf, 2019), on each 
sub-ontology of GO-terms and EC numbers and outperform in every category. 

Zhou et al (2020b) apply a GCN model, DeepGOA, to predict maize protein 
functions. The authors exploit both GO structure information and protein sequence 
information for a multi-label classification task. Since GO organizes the functional 
annotation terms into a directed acyclic graph (DAG), the authors utilize the knowl- 
edge encoded in the GO hierarchy. First, amino acids of a protein are encoded into 
one-hot encodings, a 21-dimensional feature vector for each amino acid, as there are 
20 amino acids and sometimes there are undetermined amino acids in a protein. Pro- 
teins might be different in length; therefore, the authors only extract the first 2000 
amino acids for those proteins which are longer than that. Otherwise, the encodings 
are zero-padded. So the i-th protein is represented as 


Xi = [xi ,Xi2,Xi3, ------Xi2000] (25.20) 


To learn the low-dimensional feature representation of each protein sequence, 
the authors apply CNNs of four different sizes of convolutional kernels, such as 
8, 16, 24 and 32, to extract hypothetical non-linear secondary or tertiary structure 
information. The 1D convolution operation is formulated as follows: 

Cim = f (W* Xifm:m+h)) oM € [1,k — h] (25.21) 
where A is the sliding window length, w € R*!*" is a convolutional kernel, and f(-) 
is a non-linear activation function. Then, the authors incorporate the GO structure 
into the model. To do that, graph convolutional layers are deployed to generate the 
embeddings of the GO terms by propagating information among GO terms using 
neighboring terms in the GO hierarchy. For T number of GO terms, initial one-hot 
feature description, H? € R™™*, and correlation matrix, A € R™* are computed as 
input. For the /-th layer’s representation, H! is updated using the following neigh- 
borhood information propagating equation: 


H' = f(AH"'w') (25.22) 


where A € R*™** is the symmetrically normalized correlation matrix derived from A, 
f(-) is a non-linear activation function, and W’ € R4-1*“ is the learnable transfor- 
mation matrix. Then, such graph convolutional layers are stacked to capture high- 
and low-order information of the GO DAG. In this way, DeepGOA learns a se- 
mantic representation of GO-terms, H € R**@, and protein sequence representa- 
tion, Z € R”*“, in some d-dimensional semantic space. Dot product is used to then 
compute protein-term pair association probabilities as follows: 


f = HZ7 (25.23) 


Cross-entropy loss for the multi-label loss function is used to train the model 
end-to-end. The authors experiment on the Maize PH207 inbred line (Hirsch et al, 
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2016) and the human protein sequence dataset and show that DeepGOA outperforms 
SOTA methods. 


25.2.3 Case Study 3: From Representation Learning to 
Multirelational Link Prediction in Biological Networks with 
Graph Autoencoders 


Yang et al (2020a) employ signed variational graph auto-encoder (S- VGAE) to au- 

tomatically learn graph representation, and incorporate protein sequence informa- 
tion as features for the PPI prediction task. The authors report SOTA performance 
compared to existing sequence-based models on several datasets. 

The protein interaction network is encoded as an undirected graph, with different 
signs (1.e., positive, negative or neutral) added the edges in the adjacency matrix to 
extract fine-grained features, where the model is assumed to learn negative impact of 
highly negative interactions. Moreover, the authors consider only high-confidence 
interactions in the cost function, enabling the model to learn embeddings more accu- 
rately. First, protein sequences are encoded using the CT method (Shen et al, 2007). 
All amino acids are divided into seven categories considering their dipole and side- 
chain volumes. Each group represents analogous mutations due similar character- 
istics. Thus, a protein can be represented as a sequence of numbers representing a 
category. Then, a window of size 3 amino acids slides over the numeric sequence 
one step at a time and counts the number of occurrences of each triad. Thus, the size 
of a protein CT vector is 343(=m), which can be defined as follows: 


V =(r1,7o,..0 ru] (25.24) 


where r; is the number of occurrences of each triad type. For n proteins, the input 
features of each protein can be summarized in a matrix X € R”*”, Afterwards, S- 
VGAE is employed to extract protein embeddings by combining both graph struc- 
ture and sequence information, following Kipf and Welling’s (Kipf and Welling, 
2016) variational graph auto-encoder. Considering the primary/sequence features, 
its neighborhood structures and positions in the graph, the encoder maps each pro- 
tein x; to a low-dimensional vector z;. The idea is to map proteins’ original features 
X into low dimensional embeddings Z using an augmented information adjacency 
matrix A. The encoding rule is formulated as follows: 


N 

q(Z|X A) = | [4(z:|Z,A) (25.25) 
i=l 

q(zi|Z,A) = ~ (zi|Mi,diag(o?)) (25.26) 


Mean vector, U;, and standard deviation vector, oj, is defined as follows: 


u = GCN, (X,A) (25.27) 
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logo = GCNo(X,A) (25.28) 


where GCN is a neighborhood aggregation propagation step formulated as below: 


GCN(X,A) = AReLU (AXW) (25.29) 
GCN, (X,A) = AReLU (AXW) (25.30) 
GCNg(X,A) = AReLU (AXWy) (25.31) 


where Wo, W; and W are trainable parameters and, GCN, and GCNo share Wo 
to reduce parameters. The decoder predicts the classification label of protein i and 
j by taking the dot product of their lower-dimensional embeddings z; and zj; the 
interaction probability indicates whether there is a connection between two proteins. 
This is defined as follows: 


N N 
p(AlZ) = [[[ [ 2áiilziz;) (25.32) 
i=1 j=l 


p(Aij = 1,2) = 0 (z7 z;) (25.33) 


where o(-) is the logistic sigmoid function. Thus, the S-VGE learns to encode pro- 
tein embeddings into low-dimensional features by solving the task of decoding the 
learned embeddings back to the original graph structure. Instead of using the de- 
coder as the final classification layer, the authors utilize it as a generative model for 
learning latent features. Then, three fully connected layers perform the final classi- 
fication task. Overall, the model achieves more than 98% accuracy on five different 
datasets. 

Hasibi and Michoel (2020) propose a graph feature auto-encoder (GFAE) model, 
called FeatGraphCony, which is trained on a feature reconstruction task instead of 
graph reconstruction task. The model performs well on predicting unobserved node 
features on biological networks, such as transcriptional, protein-protein and genetic 
interaction networks. FeatGraphConv investigates how well GNNs might preserve 
node features. The authors aim to identify whether or not the graph structure and 
feature values encode similar information. The relationship between a graph G and 
latent embeddings Z can be formulated using graph convolutional layers as messag- 
ing passing protocol by aggregating neighborhood information as follows: 


Z = GCN(G;0) = GCN(X,A; 6) (25.34) 
Z = o (AReLU (AXWo)W) (25.35) 
where 0 contains learnable weights, defined as 0 = Wo; W);...... W,, and o is a non- 


linear task-specific mapping function. The authors leverage four message passing 
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and neighborhood information aggregation operations. The GCN update rule (Gilmer 
et al, 2017) is followed for the i-th protein’s representation, he at the /-th layer as 
follows: 


hi l (25.36) 


' JEN (iui V deg(i) * \/deg(j) i 


The GraphSAGE (Hamilton et al, 2017b) update rule is then deployed: 


h; = Wih! + WoMean jey (uih (25.37) 
Additionally, the authors employ the GraphConv (Morris et al, 2020b) operator: 


h =Wih + $ Wohi! (25.38) 
JEN (i) 


A new update rule is also proposed: 


hi = W2 (Whi '||Mean je y(u: (Wih ')) (25.39) 


where (-||-) denotes a concatenation operation. The authors train the learnable pa- 
rameters on the embeddings ability to reconstruct the adjacency matrix, which is 
formulated as follows: 


A = Sigmoid(ZZ" ) (25.40) 


Cross-entropy loss between A and A and gradient descent are used to update the 
weights. Finally, the embeddings Z are used to predict the class Y in predicting 
missing links in the adjacency matrix and thus in the graph. 


25.3 Future Directions 


As this survey indicates, many variants of GNNs have been applied to obtain infor- 
mation on protein function. Much work remains to be done. Future directions can 
be broadly divided into two categories, methodology-oriented and task-oriented. 

Many existing GNN-based approaches are limited to proteins of the same size 
(number of amino acids). This essentially weakens model capacity for the particular 
task at hand. Therefore, future research needs to focus on size-agnostic, as well as 
task-agnostic models. Choosing the right model is always a difficult task. However, 
benchmark datasets and available packages are making it easier to develop models 
expediently. 

Enhancing model explainability is also an important direction. Some community 
bias has been observed towards focusing model development on GCNs for learn- 
ing semantic and topological information for the function prediction task. However, 
there are many other variants of GNNs. For instance, graph attention networks may 
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prove useful. Existing literature also often ignores ablation studies, which are impor- 
tant to provide a strong rationale for choosing a particular component of the model 
over others. 

Most of the PPI prediction tasks assume training a single model for an organism. 
Leveraging multi-organisms PPI networks provides more data and may result in 
better performance. In the same spirit, leveraging multi-omics data combined with 
sequence and structural data may advance the state of the art. 

Finally, we draw attention to the site-specific function prediction task, which 
provides more information and highlights specific residues that are important for 
a particular function. This fine-grained function prediction task can be even more 
critical to support other tasks, such as drug design. Transfer learning across related 
tasks may additionally provide insights for learning important attributes. 

This work is supported in part by National Science Foundation Grant No. 
1907805 and Grant No. 1763233. This material is additionally based upon work 
by AS supported by (while serving at) the National Science Foundation. Any opin- 
ion, findings, and conclusions or recommendations expressed in this material are 
those of the author and do not necessarily reflect the views of the National Science 
Foundation. 


Editor’s Notes: In addition to small molecules introduced in Chapter 25, 
large molecules such as proteins and DNA represent another domain in 
bioinformatics that started to largely leverage the techniques from graph 
neural networks. The recent popularity of graph deep learning for small 
and large molecules seems to share similar reasons. The first reason is the 
well-formulated problem and the availability of benchmark datasets while 
the other is due to the high complexity of the problem and the insufficiency 
of existing techniques. On the other hand, there is also some subtle differ- 
ence between them: The deep graph learning community seems dedicated 
to more extensive new models for small molecules than large ones previ- 
ously. But in recent years, research frontiers tend to start to transfer the 
success in small molecules to benefit larger ones, with representative works 
such as AlphaFold. 
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Graph Neural Networks in Anomaly Detection 
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Abstract Anomaly detection is an important task, which tackles the problem of dis- 
covering “different from normal” signals or patterns by analyzing a massive amount 
of data, thereby identifying and preventing major faults. Anomaly detection is ap- 
plied to numerous high-impact applications in areas such as cyber-security, finance, 
e-commerce, social network, industrial monitoring, and many more mission-critical 
tasks. While multiple techniques have been developed in past decades in address- 
ing unstructured collections of multi-dimensional data, graph-structure-aware tech- 
niques have recently attracted considerable attention. A number of novel techniques 
have been developed for anomaly detection by leveraging the graph structure. Re- 
cently, graph neural networks (GNNs), as a powerful deep-learning-based graph rep- 
resentation technique, has demonstrated superiority in leveraging the graph structure 
and been used in anomaly detection. In this chapter, we provide a general, compre- 
hensive, and structured overview of the existing works that apply GNNs in anomaly 
detection. 


26.1 Introduction 


In the era of machine learning, sometimes, what stands out in the data is more 
important and interesting than the normal. This branch of task is called anomaly 
detection, which concentrates on discovering “different from normal” signals or 
patterns by analyzing a massive amount of data, thereby identifying and prevent- 
ing major faults. This task plays a key on in several high-impact domains, such as 
cyber-security (network intrusion or network failure detection, malicious program 
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detection), finance (credit card fraud detection, malicious account detection, cashout 
user detection, loan fraud detection), e-commerce (reviews spam detection), social 
network (key player detection, anomaly user detection, real money trading detec- 
tion), and industrial monitoring (fault detection). 

In the past decades, many techniques have been developed for anomaly detec- 
tion by leveraging the graph structure, a.k.a. graph-based anomaly detection. Unlike 
non-graph anomaly detection, they further take the inter-dependency among each 
data instance into consideration, where data instances in a wide range of disciplines, 
such as physics, biology, social sciences, and information systems, are inherently re- 
lated to one another. Compare to the non-graph-based method, the performance of 
the graph-based method is greatly improved. Here, we provide an illustrative ex- 
ample of malicious program detection in the cyber-security domain in Figure 26.1. 
In a phishing email attack as shown in Figure 26.1, to steal sensitive data from the 
database of a computer/server, the attacker exploits a known venerability of Mi- 
crosoft Office by sending a phishing email attached with a malicious .doc file to 
one of the IT staff of the enterprise. When the IT staff member opens the attached 
.doc file through the browser, a piece of a malicious macro is triggered. This ma- 
licious macro creates and executes a malware executable, which pretends to be an 
open-source Java runtime (Java.exe). This malware then opens a backdoor to the ad- 
versary, subsequently allowing the adversary to read and dump data from the target 
database via the affected computer. In this case, signature-based or behavior-based 
malware detection approaches generally do not work well in detecting the mali- 
cious program in our example. As the adversary can make the malicious program 
from scratch with binary obfuscation, signature-based approaches would fail due 
to the lack of known malicious signatures. Behavior-based approaches may not be 
effective unless the malware sample has previously been used to train the detection 
model. It might be possible to detect the malicious program using existing host- 
level anomaly detection techniques. These host-based anomaly detection methods 
can locally extract patterns from process events as the discriminators of abnormal 
behavior. However, such detection is based on observations of single operations, 
and it sacrifices the false positive rate to detect the malicious program. For exam- 
ple, the host-level anomaly detection can detect the fake “Java.exe” by capturing the 
database read. However, a Java-based SQL client may also exhibit the same opera- 
tion. If we simply detect the database read, we may also classify normal Java-based 
SQL clients as abnormal program instances and generate false positives. In the en- 
terprise environment, too many false positives can lead to the alert fatigue problem, 
causing cyber-analysts to fail to catch up with attacks. To accurately separate the 
database read of the malicious Java from the real Java instances, we need to con- 
sider the higher semantic-level context of the two Java instances. As shown in Figure 
??, malicious Java is a very simple program and directly accesses the database. On 
the contrary, a real Java instance has to load a set of .DLL files in addition to the 
database read. By comparing the behavior graph of the fake Java instance with the 
normal ones, we can find that it is abnormal and precisely report it as a malicious 
program instance. Thus, leveraging the graph helps to identify the anomaly data 
instances. 
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Figure 26.1: An illustrative example of malicious program detection in the cyber- 
security domain. The left side shows an example of a phishing email attack: the 
hacker creates and executes a malware executable, which pretends to be an open- 
source Java runtime (Java.exe); this malware then opens a backdoor to the adversary, 
subsequently allowing the adversary to read and dump data from the target database 
via the affected computer. The right side demonstrates the behavior graph of the 
malicious Java.exe vs. normal Java runtime. 


Specifically, the benefit of graph-based method is four-folded: 


e Inter-dependent Property — Data instances in a wide range of disciplines, such 
as physics, biology, social sciences, and information systems, are inherently 
related to one another and can form a graph. These graph structures can provide 
additional side information to identify the anomalies in addition to the attributes 
of each data instance. 
Relational Property — The anomaly data instances sometimes can exhibit them- 
selves as relational, e.g., in the fraud domain, the context of the anomaly data 
instance has a high probability of being abnormal; the anomaly data instance 
is closely related to a group of data instances. If we identify one anomaly data 
instance in the graph, some other anomaly data instances based on it can be 
detected. 

e Fruitful Data Structure — The graph is a data structure encoding fruitful in- 
formation. The graph consists of nodes and edges, enabling the incorporation 
of node and edge attributes/types for anomaly data instance identification. Be- 
sides, multiple paths exist between each pair of data instances, which allows the 
relation extraction in different ranges. 

e Robust Data Structure — The graph is a more adversarially robust data structure, 
e.g., attacker or fraudster usually can only attack or fraud the specific data in- 
stance or its context and have a limited global view of the whole graph. In this 
case, the anomaly data instance is harder to fit into the graph as well as possible. 


Recent years have witnessed a growing interest in developing deep-learning- 
based algorithms on the graph, including unsupervised methods (Grover and Leskovec, 
2016; Liao et al, 2018; Perozzi et al, 2014) and supervised methods (Wang et al, 
2016, 2017e; Hamilton et al, 2017b; Kipf and Welling, 2017b; Veličković et al, 
2018). Among these deep-learning-based algorithms on the graph, the graph neural 
networks (GNNs) (Hamilton et al, 2017b; Kipf and Welling, 2017b; Veličković et al, 
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2018), as powerful deep graph representation learning techniques, have demon- 
strated superiority in leveraging the graph structure. The basic idea is to aggregate 
information from local neighborhoods in order to combine the content feature and 
graph structures to learn the new graph representation. In particular, GCN (Kipf 
and Welling, 2017b) leverages the “graph convolution” operation to aggregate the 
feature of one-hop neighbors and propagate multiple-hop information via the iter- 
ative “graph convolution’. GraphSage (Hamilton et al, 2017b) develops the graph 
neural network in an inductive setting, which performs neighborhood sampling and 
aggregation to generate new node representation efficiently. GAT (Veličković et al, 
2018) further incorporates attention mechanism into GCN to perform the attentional 
aggregation of the neighborhoods. Given the importance of graph-based anomaly 
detection and the success of graph neural networks, both academia and industry 
are interested in applying GNNs to tackle the problem of anomaly detection. In re- 
cent years, some researchers have successfully applied GNNs in several important 
anomaly detection tasks. In this book chapter, we summarize different GNN-based 
anomaly detection approaches and provide taxonomies for them according to var- 
ious criteria. Despite the more than 10+ papers published in the last three years, 
several challenges remain unsolved until now, which we summarize and introduce 
in this chapter as below. 


e Issues Unlike GNNs applications in other domains, the GNNs applications in 
anomaly detection have several unique issues, which comes from data, task, 
and model. We briefly discuss and summarize them to provide a comprehensive 
understanding of the difficulties of the problems. 

Pipeline There are various GNN-based anomaly detection works. It is challeng- 
ing and time-consuming to understand the big pictures of all these works. To 
facilitate an easy understanding of existing research on this line, we summarize 
the general pipeline of GNN-based anomaly detection approaches. 

Taxonomy There are already several works in the domain of GNN-based anomaly 
detection. Compared with other GNN applications, GNN-based anomaly detec- 
tion is more complicated due to unique challenges and problem definitions. To 
provide a quick understanding of the similarity and differences between exist- 
ing works, we list some representative works and summarize novel elaborated 
taxonomies according to various criteria. 

Case Studies We provide the case studies of some representative GNN-based 
anomaly detection approaches. 


The rest of this chapter is organized as follows. Section 26.2 discusses and sum- 
marizes the issues of the GNN-based anomaly detection. Section 26.3 provides the 
unified pipeline of the GNN-based anomaly detection. Section 26.4 provides the 
taxonomies of existing GNN-based anomaly detection approaches. Section 26.5 
provides the case studies of some representative GNN-based anomaly detection ap- 
proaches. In the last section, we provide the discussion and future directions. 
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26.2 Issues 


In this section, we provide a brief discussion and summary of the issues in GNN- 
based anomaly detection. In particular, we group them into three: (i) data-specific 
issues, (ii) task-specific issues, and (iii) model-specific issues. 


26.2.1 Data-specific issues 


As the anomaly detection systems usually work with real-world data, they demon- 
strate high volume, high dimensionality, high heterogeneity, high complexity, and 
dynamic property. 

High Volume — With the advance of information storage, it is much easier to 
collect large amounts of data. For example, in an e-commerce platform like Xianyu, 
there are over 1 billion second-hand goods published by over ten millions users; 
in an enterprise network monitoring system, the system event data collected from 
a single computer system in one day can easily reach 20 GB, and the number of 
events related to one specific program can easily reach thousands. It is prohibitively 
expensive to perform the analytic task on such massive data in terms of both time 
and space. 

High Dimensionality — Also, benefit from the advance of the information stor- 
age, rich amount of information is collected. It results in high dimensionality of 
the attributes for each data instance. For example, in an e-commerce platform like 
Xianyu, different types of attributes are collected for each data instance, such as 
user demographics, interests, roles, as well as different types of relations; in an en- 
terprise network monitoring system, each collected system event is associated with 
hundreds of attributes, including information of involved system entities and their 
relationships, which causes the curse of dimensionality. 

High Heterogeneity — As rich types of information are collected, it results in 
high heterogeneity of the attributes for each data instance: the feature of each data 
instance can be multi-view or multi-sourced. For example, in an e-commerce plat- 
form like Xianyu, multiple types of data are collected from the user, such as personal 
profile, purchase history, explore history, and so on. Nevertheless, multi-view data 
like social relations and user attributes have different statistical properties. Such het- 
erogeneity poses a great challenge to integrate multi-view data. 

High Complexity — As we can collect more and more information, the collected 
data is complex in content: it can be categorical or numerical, which increases the 
difficulty of leveraging all the contents jointly. 

Dynamic Property — The data collection is usually conducted every day or con- 
tinuously. For example, billions of credit card transactions are performed every day; 
billions of click-through traces of web users are generated each day. This kind of 
data can be thought of as streaming data, and it demonstrates dynamic property. 

The above data-specific issues are general and apply to all kinds of data. So 
we also need to discuss the graph-data-specific issues, including relational prop- 
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erty, graph heterogeneity, graph dynamic, variety of definitions, lack of intrinsic 
distance/similarity metrics, and search space size. 

Relational Property — The relational property of the data makes it challenging to 
quantify the anomalousness of graph objects. While in traditional outlier detection, 
the objects or data instances are treated as independent and identically distributed 
(i.i.d.) from each other, the data instances in graph data are pair-wise correlated. 
Thus, the “spreading activation” of anomalousness or “guilt by associations” needs 
to be carefully accounted for. For example, the cash-out users not only have ab- 
normal features, but also behavior abnormally in interaction relations. They may 
simultaneously have many transactions and fund transfer interactions with particu- 
lar merchants, which is hard to be exploited by traditional feature extraction. 

Graph Heterogeneity — Similar to the general data-specific issues of high het- 
erogeneity, the graph instance type, and relation type are usually heterogeneous. 
For example, in a computer system graph, there are three types of entities: process 
(P), file (F), and INETSocket (I) and multiple types of relations: a process forking 
another process (P—+P), a process accessing a file (P—F), a process connecting to 
an Internet socket (P—I), and so on. Due to the heterogeneity of entities (nodes) 
and dependencies (edges) in a heterogeneous graph, the diversities between differ- 
ent dependencies vary dramatically, significantly increasing the difficulty of jointly 
leveraging these nodes and edges. 

Graph Dynamic — As the data are collected periodically or continuously, the 
constructed graph also demonstrates the dynamic property. It is challenging to de- 
tect the anomaly due to its dynamic nature. Some anomalous operations show some 
explicit patterns but try to hide them in a large graph, while others are with implicit 
patterns. Take an explicit anomaly pattern in a recommender system as an exam- 
ple. As anomalous users usually control multiple accounts to promote the target 
items, the edges between these accounts and items may compose a dense subgraph, 
which emerges in a short time period. In addition, although the accounts which 
involve the anomaly perform anomalous operations sometimes, these accounts per- 
form normally most of the time, which hides their long-term anomalous behavior 
and increases the difficulty of detection. 

Variety of Definitions — The definitions of anomalies in graphs are much more 
diverse than in traditional outlier detection, given the rich representation of graphs. 
For example, novel types of anomalies related to graph substructures are of interest 
for many applications, e.g., money-laundering rings in trading networks. 

Lack of Intrinsic Distance/Similarity Metrics — The intrinsic distance/similarity 
metrics are not clear. For example, in real computer systems, given two programs 
with thousands of system events related to them, it is a difficult task to measure their 
distance/similarity. 

Search Space Size — The main issue associated with more complex anomalies 
such as graph substructures is that the search space is huge, as in many graph- 
theoretical problems associated with graph search. The enumeration of possible 
substructures is combinatorial, making the problem of finding out the anomalies 
a much harder task. This search space is enlarged even more when the graphs are 
attributed as the possibilities span both the graph structure and the attribute space. 
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As a result, the graph-based anomaly detection algorithms need to be designed not 
only for effectiveness but also for efficiency and scalability. 


26.2.2 Task-specific Issues 


Due to the unique characteristics of the anomaly detection task, the issues also come 
from the problems, including labels quantity and quality, class imbalance and asym- 
metric error, and novel anomalies. 

Labels Quantity and Quality — The major issue of anomaly detection is that the 
data often has no or very few class labels. It is unknown which data is abnormal 
or normal. Usually, it is costly and time-consuming to obtain ground-truth labels 
from the domain expert. Moreover, due to the complexity of the data, the produced 
label may be noisy and biased. Therefore, this issue limits the performance of the 
supervised machine learning algorithm. What is more, the lack of true clean labels, 
i.e., ground truth data, also makes the evaluation of anomaly detection techniques 
challenging. 

Class Imbalance and Asymmetric Error — Since the anomalies are rare and only 
a small fraction of the data is excepted to be abnormal, the data is extremely imbal- 
anced. Moreover, the cost of mislabeling a good data instance versus a bad instance 
may change depending on the application and further could be hard to estimate 
beforehand. For example, mis-predicting a cash-out fraudster as a normal user is es- 
sentially harmful to the whole financial system or even the national security, while 
mis-predicting a normal user as a fraudster could cause customer loss fidelity. There- 
fore, the class imbalance and asymmetric error affect the machine-learning-based 
method seriously. 

Novel Anomalies — In some domain, such as fraud detection or malware detec- 
tion, the anomalies are created by the human. They are created by analyzing the 
detection system and designed to be disguised as a normal instance to bypass the 
detection. As a result, not only should the algorithms be adaptive to changing and 
growing data over time, they should also be adaptive to and be able to detect novel 
anomalies in the face of adversaries. 


26.2.3 Model-specific Issues 


Apart from data-specific and task-specific issues, it is also challenging to apply the 
graph neural network directly to anomaly detection task sdue to its unique model 
properties, such as homogeneous focus and vulnerability. 

Homogeneous Focus — Most graph neural network models are designed for ho- 
mogeneous graph, which considers a single type of nodes and edges. In many real- 
world applications, data can be naturally represented as heterogeneous graphs. How- 
ever, traditional GNNs treat different features equally. All the features are mapped 
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and propagated together to get the representations of nodes. Considering that the 
role of each node is just a one-dimensional feature in the high dimensional feature 
space, there exist more features that are not related to the role, e.g., age, gender, and 
education. Thus the representation of applicants with neighbors of different roles 
has no distinction in representation space after neighbor aggregation, which causes 
the traditional GNNs to fail. 

Vulnerability — Recently theoretical studies prove the limitations and vulnerabil- 
ities of GNNs, when graphs have noisy nodes and edges. Therefore, a small change 
to the node features may cause a dramatic performance drop and failing to tackle 
the camouflage, where fraudsters would sabotage the GNN-based fraud detectors. 


26.3 Pipeline 


In this section, we introduce the standard pipeline of the GNN-based anomaly detec- 
tion. Typically, GNN-based anomaly detection methods consist of three important 
components, including graph construction and transformation, graph representation 
learning, and prediction. 


26.3.1 Graph Construction and Transformation 


As discussed in the previous section, a real-world anomaly detection system has 
some data-specific issues. Therefore, it requires data analysis on the raw data to ad- 
dress them. Then the graph can be constructed to capture the complex interactions 
and eliminate the data redundancies. Based on the type of the data instance and 
relations, the graph can be constructed as a homogeneous graph or heterogeneous 
graph, where a homogeneous graph only has a single-typed data instance and rela- 
tion, and a heterogeneous graph has multi-typed data instances and relations. Based 
on the availability of the timestamp, the graph can be constructed as a static graph 
or a dynamic graph, where a static graph refers to the graph that has fixed nodes 
and edges, and a dynamic graph refers to the graph that has nodes and/or edges 
change over time. Based on the availability of the node and/or edge attributes, the 
constructed graph can be a plain graph or an attributed graph, where the plain graph 
only contains the structure information and the attributed graph has attributes on 
nodes and/or edges. 

When the constructed graph is heterogeneous, simply aggregating neighbors can- 
not capture the semantic and structural correlations among different types of enti- 
ties. To address the graph heterogeneity issue, a graph transformation is performed 
to transform the heterogeneous graph to a multi-channel graph guided by the meta- 
paths, where a meta-path (Sun et al, 2011) is a path that connects entity types via 
a sequence of relations over a heterogeneous network. For example, in a computer 
system, a meta-path can be the system events (P—P, P—F, and P->]), with each 
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one defining a unique relationship between two entities. The multi-channel graph 
is a graph with each channel constructed via a certain type of meta-path. Formally, 
given a heterogeneous graph ¥ with a set of meta-paths ./@ = {M1, ..., M] x}, the 


transformed multi-channel network @ is defined as follows: 
G ={G|Y, = (V;,&,A;),i=1,2,...,|4])} (26.1) 


where 6; denotes the homogeneous links between the entities in %, which are con- 
nected through the meta-path M;. Each channel graph %; is associated with an adja- 
cency matrix A;. |.@| indicates the number of meta-paths. Notice that the potential 
meta-paths induced from the heterogeneous network can be infinite, but not every- 
one is relevant and useful for the specific task of interest. Fortunately, there are some 
algorithms (Chen and Sun, 2017) proposed recently for automatically selecting the 
meta-paths for particular tasks. 


26.3.2 Graph Representation Learning 


After the graph is constructed and transformed, graph representation learning is 
conducted to get the proper new representation of the graph. Generally GNNs 
are built by stacking seven types of basic operations, including neural aggrega- 
tor function AGG(), linear mapping function MAPyinea,(), nonlinear mapping func- 
tion MAProntinear(), multilayer perceptron function MLP(), feature concatenation 
CONCAT (), attentional feature fusion COMB,,, and readout function Readout(). 
Among these operations, linear mapping function, nonlinear mapping function, mul- 
tilayer perceptron function, feature concatenation, and attentional feature fusion are 
typical operations used in traditional deep learning algorithms. Their formal de- 
scriptions are described as follows: 
Linear Mapping Function MAPyinea;(): 


MAPiinear (x) = Wx (26.2) 


where x is the input feature vector, and W is the trainable weight matrix. 
Nonlinear Mapping Function MAProntinear(): 


MAProntinear(X) = o (Wx) (26.3) 


where x is the input feature vector, W is the trainable weight matrix, and o() repre- 
sents the non-linear activation function. 
Multilayer Perceptron Function MLP(): 


MLP(x) = o(WS.--o(W!x)) (26.4) 
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where x is the input feature vector, W! with i = 1,...,k is the trainable weight ma- 
trix, k indicates the number of layers, and o() represents the non-linear activation 
function. 

Feature Concatenation CONCAT (): 


CONCAT (X1,°*-Xn) = [X1; Xn] (26.5) 
where n indicates the number of the features. 
Attentional Feature Fusion COMBar(): 


n 
COMBan (X1, Xn) = L softmax(Xi)Xi (26.6) 
i=l 


PO exp(MAP(x;)) 
so ftmax(x;) a vey exp(MAP(x;)) (26.7) 


where MAP() can be linear or nonlinear. 

Different from traditional deep learning algorithm, the GNNs have its unique 
operation—neural aggregation function AGG(). Based on the level of object to ag- 
gregate, it can be categorized into three specific types: node-wise neural aggregator 
AGGnode(), layer-wise neural aggregator AGGjgye,(), and path-wise neural aggrega- 
tor AGG path(). 

Node-wise Neural Aggregator AGGyode() is the GNN module that aims to aggre- 
gate the node neighborhoods, which can be described as follows, 

by) = AG Groa BPO MPD) ys) (26.8) 
where i is meta-path (relation) indicator, k € {1,2,...K} is the layer indicator, ni 
is the feature vector of node v for relation M; at the k-th layer, Mi indicates the 
neighbourhoods of node v under the relation M;. Based on the way the the node 
neighborhoods are aggregated, typically, the node-level neural aggregator can be 
GCN AGGCC () (Kipf and Welling, 2017b), GAT AGG®^T () (Veličković et al, 
2018) or Message-Passing AGGMPNN () (Gilmer et al, 2017). For the GCN and GAT, 
the formulations can be described by Equation 8. While for the Message-Passing, 


the edges are also used during the node-level aggregation. Formally, it can be de- 
scribed as follows, 


hy) AGG AP h OD a e) (26.9) 


where hË- ) denotes the edge embedding between the target node v and its neigh- 


bor node u, and {} indicates a fusion function to combine the target node, its neigh- 
bor node and the corresponding edge between them. 

Layer-wise Neural Aggregator AGGlayer() is the GNN module that aims to ag- 
gregate the context information from different hops. For example, if layer num- 
ber k = 2, the GNN gets 1-hop neighborhood information, and if layer number 
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k = K +1, the GNN gets K-hop neighborhood information. The larger the k is, the 
more global information the GNN obtains. Formally, this function can be described 
as follows, 


1 = AGG ayer PEI WO) (26.10) 

where (9 is the aggregated representation of (k — 1)—hop neighborhood node v 
for relation M; at the k-th layer. 

Path-wise Neural Aggregator AGGjayer() is the GNN module that aims to ag- 

gregate the context information from different relations. Generally, the relation can 


be described by meta-path (Sun et al, 2011) based contextual search. Formally, this 
function can be described as follows, 


p® =10) (26.11) 


1 M 
Py = AGG parn (ps, pt) (26.12) 
where pË is the aggregated final layer representation of node v for relation M;. 
Then the final node representation is described by the fusion representation from 
different meta-paths (relations) as follows, 


n/n) — p, (26.13) 


Based on the task, we can also compute the graph representation by performing 
readout function Readout() to aggregate all the nodes’ final representations, which 
can be described as follows, 


g = Readout (hY n\n”) (26.14) 


Typically, we can obtain different levels of graph representations, including node- 
level, edge-level, and graph-level. The node-level and edge-level representation are 
the most preliminary representations, which can be learned via graph neural net- 
work. The graph-level representation is a higher-level representation, which can be 
obtained by performing the readout function to the node-level and edge-level repre- 
sentations. Based on the target of the task, the specific level of graph representations 
is fed to the next stage. 


26.3.3 Prediction 


After the graph representation is learned, they are fed to the prediction stage. De- 
pends on the task and the target label, there are two types of prediction: classification 
and matching. In the classification-based prediction, it assumes that enough labeled 
anomaly data instances are provided. A good classifier can be trained to identify 
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if the given graph target is abnormal or not. As mentioned in the issues section, 
there might be no or few anomaly data instances. In this case, the matching-based 
prediction is usually used. If there are very few anomaly samples, we learn the rep- 
resentation of them, and when the candidate sample is similar to one of the anomaly 
samples, an alarm is triggered. If there is no anomaly sample, we learn the represen- 
tation of the normal data instance. When the candidate sample is not similar to any 
of the normal samples, an alarm is triggered. 


26.4 Taxonomy 


In this section, we provide the taxonomies of existing GNN-based anomaly detec- 
tion approaches. Due to the variety of graph data and anomalies, the GNN-based 
anomaly detection can have multiple taxonomies. Here we provided four types of 
taxonomy in order to give a quickly understand of the similarity and difference 
between existing works, including static/dynamic graph taxonomy, homogeneous / 
heterogeneous graph taxonomy, plain/attributed graph taxonomy, object taxonomy, 
and task taxonomy. 

In task taxonomy, the exiting works can be categorized into GNN-based anomaly 
detection in financial networks, GNN-based anomaly detection in computer net- 
works, GNN-based anomaly detection in telecom networks, GNN-based anomaly 
detection in social networks, GNN-based anomaly detection in opinion networks, 
and GNN-based anomaly detection in sensor networks. 

In anomaly taxonomy, the existing works can be categorized into node-level 
anomaly detection, edge-level anomaly detection, and graph-level anomaly detec- 
tion. 

In static /dynamic graph taxonomy, the existing works can be categorized into 
static GNN-based anomaly detection and dynamic GNN-based anomaly detection. 

In homogeneous /heterogeneous graph taxonomy, the exiting works can be 
categorized into homogeneous GNN-based anomaly detection and heterogeneous 
GNN-based anomaly detection. 

In plain /attributed graph taxonomy, the exiting works can be categorized into 
plain GNN-based anomaly detection and attributed GNN-based anomaly detection. 

In object taxonomy, the exiting works can be categorized into: classification- 
based approach and matching-based approach. 

We present our taxonomy with more details in Table 1. 


26.5 Case Studies 


In this section, we provide the case studies to give the details of some representative 
GNN-based anomaly detection approaches. 
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Table 26.1: Summary of GNN-based anomaly detection approaches. 


Static Homogeneous Plain 


Approach Year Venue Task Anomaly Dynamic Heterogeneous Attributed Model Object 
alicious Ace GCN, 
GEM (Liu et al, 2018f) 2018|crkm| Malicious Account | Noge | Static | Heterogeneous | Attributed ae Classification 
Detection j Attention path) 
HACUD (Hu etal, 2019») [2019| aaar] Cashout User Node | Static | Heterogeneous | Attributed | Attenti i Classification 
Detection ention( Feature, path) 
ee GCN, 
DeepHGNN (Wang et al, 2019h) |2019 | spm | Malicious Program | Noge | static | Heterogeneous | Attributed a Classification 
g Detection © Attention( path) 
ans PGR GCN, 
MatchGNet (Wang et al, 20191). |2019 | gcar | Malicious Program | Graph | Static | Heterogeneous | Attributed Atenin Matching 
Detection enton (node layer,path) 
Malicious Connection GCN, 
: aAA 2 Ks aiie Enei a Te 
AddGraph (Zheng et al, 2019) |2019 | ICAI Detected Edge | Dynamic | Homogeneous | Plain Gaus, Matching 
Malicious Account GCN, Classification, 
emiGNN (Wang et z Py 5 aieo erener nson l AtADUtS i a 
SemiGNN (Wang et al, 2019b) |2019 | ICDM D Node | Static | Heterogeneous | Attributed Attention oe path) Matching 
z zv Tradi GAT, 
MVAN (Tao et al, 2019) |2019 | KDp | Re Money Trading | Node | Static | Heterogeneous | Attributed ae Classification 
Detection © Attention (path view) 
Spam n P MPNN, : i 
GAS (Li et al, 2019a) 2019 | CIKM i Edge | Static | Heterogeneous | Attributed ‘ Classification 
Detection bä id Attention (message) 
` : Key Player , 3 GCN, FE 
iDetective (Zhang et al, 2019a) |2019 | CIKM : Node Static | Heterogeneous | Attributed oes Classification 
Detection © Attention( path) 
aly Usei 
GAL (Zhao et al, 2020) |2020 | CIKM Agota aest Node | Static | Homogeneous | Attributed GCN/GAT Matching 
Fraud i ; GCN, Saat 
CARE-GNN (Dou et al, 2020) |2020 | CIKM : Node | Static | Heterogeneous | Attributed ee Classification 
Detection 2 Attention node) 


a a a a a a a a 


26.5.1 Case Study 1: Graph Embeddings for Malicious Accounts 
Detection 


Graph embeddings for malicious accounts detection (GEM) (Liu et al, 2018f) is the 
first attempt to apply the GNN to anomaly detection. The aim of GEM is to detect 
the malicious account at Alipay pay, a mobile cashless payment platform. 

The graph constructed from the raw data is static and heterogeneous. The con- 
strued graph Z = (V,&) consists of 7 types of nodes, including account typed 
nodes (U) and 6 types of device typed nodes (phone number (PN), User Machine ID 
(UMID), MAC address (MACA), International Mobile Subscriber Identity (IMSI), 
Alipay Device ID (APDID) and a random number generated via IMSI and IMEI 
(TID), such that “Y = U UPN U UMID U MACA U IMSI UAPDID UTID. To over- 
come the heterogeneous graph challenge and make GNN applicable to the graph, 
through graph transformation, GEM constructs a 6-channel graph G = {G|G, = 
(%, G&,Ai),i =1,2,...,|. W|} with |4 |= 6. In particular, 6 types of edges are specif- 
ically modeled to capture the edge heterogeneity, e.g., account connects phone num- 
ber (U — PN), account connects UMID (U —> UMID), account connects MAC ad- 
dress (U —> MACA), account connects IMSI (U —> IMSI), account connects Alipay 
Device ID (U —> APDID) and account connects TID (U — TID). As the activity 
attributes are constructed, the constructed graph is an attributed graph. After the 
graphs are constructed and transformed, GEM performs a graph convolutional net- 
work to aggregate the neighborhood on each channel graph. As each channel graph 
is treated as a homogeneous graph corresponding to a specific relation, GNN can be 
directly applied to each channel graph. 
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During the graph representation learning stage, the node aggregated representa- 
tion ni is computed by performing a GCN aggregator AGGCC" (). To get the path 
aggregated representation, it adopts the attentionally feature fusion to fuse the node 
aggregated representation obtained in each channel graph 4’. Besides, an activity 
feature for each node is constructed, and it adds the linear mapping of this activity 
feature to the attentional feature fusion of the path aggregated representations. For- 
mally, the GNN operations can be described as follow. 

Node-wise aggregation: 


bf) = AGG pode (bY, HOY} cy) 


sacs ce (26.15) 
= AGGON (ni D ve Yen 
Path-wise aggregation: 
pl") = MAP ;inear (Xv) +COMBar WP ®©.. OO) 2610 
Layer-wise aggregation: 
1%) = p (26.17) 
Final node representation: 
pfi) = 1%) (26.18) 


where K indicates the number of the layers. 
The object of GEM is classification. It feeds the learned account node embedding 
to a standard logistic loss function. 


26.5.2 Case Study 2: Hierarchical Attention Mechanism based 
Cash-out User Detection 


Hierarchical attention mechanism based cash-out user detection (HACUD) (Hu 
et al, 2019b) applied the GNN to the fraud user detection at Credit Payment Ser- 
vices platform, where the fraud user performs the cash-out fraud, that pursues cash 
gains with illegal or insincere intent. 

HACUD also constructs a static heterogeneous graph from the raw data. Specif- 
ically, it consists of multiple types of nodes (i.e., User (U), Merchant (M), Device 
(D)) with rich attributes and relations (i.e., fund transfer relation between users and 
transaction relation between users and merchants). Different from the way GEM 
deal with the graph heterogeneity issues, during the graph transformation stage, 
HACUD only models the user nodes and considers two specific types of meta-paths 
(relations) between pairwise of users, including User-(fund transfer)-User (UU) and 
User-(transaction)-Merchant-(transaction)-User (UMU) and constructs a 2-channel 
graph, such that Ê = {|G = (WV, &,Ai),i = 1,...,|-W|} with |. W| =2 and GY EU. 
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The two selected meta-paths capture different semantics. For example, the UU path 
connects users having fund transfers from one to another, while the UMU connects 
users having transactions with the same merchants. Then each channel graph is ho- 
mogeneous and can work with GNN directly. As the user attributes are available, 
the constructed graph is attributed. 

In the graph representation stage, the node-wise aggregation is performed to each 
channel graph via a convolutional graph network. Different from GEM (Liu et al, 
2018f), it adds and joins the user feature x, to the aggregated node representation 
in an attentional way. Then the node-wise aggregation extends to a 3-step proce- 
dure, including (a) initial node-wise aggregation, (b) feature fusion, and (c) feature 
attention. After the initial aggregated node representation At’ is computed vis GCN 
AGGCC (), it is fused with user feature x, through a feature fusion. Next, it per- 
forms the feature attention. Since only 1-hop neighborhoods are considered, there is 
no layer-wise aggregation, and the final node-wise aggregated representations ni? 
are fed to the path-wise aggregation directly. Formally, it can be described as fol- 
lows, 

Node-wise aggregation: 


(a)Initial node-wise aggregation: 


Al = AGGnode (hf? , me Jueni) 


, f (26.19) 
= AGGONN (h? 7 {mË Juei) 
(b)Feature fusion: 
p = MAPpontinear (CONCAT (MAPiinear (at? ) , MAPiinear (x, ) ) ) (26.20) 
(c)Feature attention: 
af? = MAProntinear (MAPhontinear (CONCAT (MAPiinear (xy) , p )) (26.21) 
h\ = so ftmax( a) O (26.22) 
Path-wise aggregation: 
Py = AGG path hO , hP) 
0) 01) (26.23) 
= COMBa:(h\” hi”) 
Final node representation: 
bine) = MLP(p,) (26.24) 


where © denotes the element-wise product. As only one-hop information is used, 
there is no layer indicator k. 
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As same as GEM, the object of HACUD is classification. It feeds the learned 
user node embedding to a standard logistic loss function. 


26.5.3 Case Study 3: Attentional Heterogeneous Graph Neural 
Networks for Malicious Program Detection 


Attentional heterogeneous graph neural network for malicious program detection 
(DeepHGNN) (Wang et al, 2019h) applied the GNN to the malicious program de- 
tection in a computer system of an enterprise network. 

The raw data is a large volume of system behavioral data with rich informa- 
tion on program/process level events. A static heterogeneous graph is constructed to 
model the program behaviors. Formally, given the program event data across many 
machines within a time window (e.g., 1 day), a heterogeneous graph Y = (V, &) is 
constructed for the target program. Y denotes a set of nodes, with each one repre- 
senting an entity of three types: process (P), file (F), and INETSocket (I). Namely, 
VY =PUFUL. @ denotes a set of edges (v;,vg,r) between the source entity v, and 
destination entity vy with relation r. To address the heterogeneous graph challenges, 
it takes three types of relations, including: (1) a process forking another process 
(P—+P), (2) a process accessing a file (PF), and (3) a process connecting to an In- 
ternet socket (PI). Similar to GEM, DeepHGNN designs a graph transformation 
module to transform the heterogeneous graph to a 3-channel graph guided by above 
three meta-paths (relations), such that Ê = {Y|Y = (%,&,Ai),i = 1,2, ..., |41} 
with || = 3 and % = V. The attributes are constructed for each node. Since the 
process node, file node, and INETSocket node has quite different attributes, the 
graph statistic features x6) (estat) are constructed and act as the node attributes. 

Similar to the GEM and HACUD, DeepHGNN also adopts the graph convo- 
lutional network AGG? () for node-wise aggregation. Three layers are used in 
order to capture program behavior within 3-hop contexts. Different from GEM and 
HACUD, DeepHGNN uses the graph statistic node attributes as the initialization of 
the node representation for each channel graph. After the three node-wise aggre- 
gation and layer-wise aggregation, the node representations from different channel 
graphs are fused via the attentional feature fusion as GEM and HACUD. Formally, 
it can be described as follows, 

Node-wise aggregation: 


pO — Olesa) (26.25) 


v ay 


hO = AGGyoae(ay (HP OY} e) 


. . (26.26) 
= AGEN (HOD, HOEY} e a) 


Layer-wise aggregation: 
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OY ~,O® (26.27) 
Path-wise aggregation: 
Py = COMB a (WY), 1) (26.28) 
Final node representation: 
ni") =p, (26.29) 


The object of DeepHGNN is classification. However, it is different from GEM 
and HACUD, which simply build single classifiers for all the samples. DeepHGNN 
formulates the problem of program reidentification in malicious program detection. 
The graph representation learning aims to learn the representation of the normal 
target program, and each target program learns a unique classifier. Given a target 
program with corresponding event data during a time window U = {e1,e2,...} anda 
claimed name/ID, the system checks whether it belongs to the claimed name/ID. If 
it matches the behavior pattern of the claimed name/ID, the predicted label should 
be +1; otherwise, it should be —1. 


26.5.4 Case Study 4: Graph Matching Framework to Learn the 
Program Representation and Similarity Metric via Graph 
Neural Networks for Unknown Malicious Program 
Detection 


Graph matching framework to learn the program representation and similarity met- 
ric via graph neural network (MatchGNet) (Wang et al, 20191) is another GNN- 
based anomaly detection approach for malicious program detection in a computer 
system of an enterprise network. MatchGNet is different from DeepHGNN in five 
aspects: (1) after the graph transformation, the resulted channel graph only keep the 
target type node — process node, which is similar to HACUD, (2) the raw program 
attributes are used as the program node representation initialization, (3) the GNN 
aggregation is conducted hierarchically in node-wise, layer-wise, and path-wise, (4) 
the anomaly target is the subgraph of the target program (5) the final graph repre- 
sentation is fed to a similarity learning framework with contrastive loss to deal with 
the unknown anomaly. 

It follows a similar style to construct the static heterogeneous graph from system 
behavioral data. In the graph transformation, it adopts three meta-paths (relations): 
a process forking another process (P — P), two processes accessing the same file 
(P + F — P), and two processes opening the same internet socket (P +— I — P) with 
each one defining a unique relationship between two processes. Based on them, a 
3-channel graph is constructed from the the heterogeneous graph, such that G = 
{G|G = (%, G,Ai),i = 1,...,| |} with |. |= 3 and % € P. Then the GNN can be 


574 Shen Wang, Philip S. Yu 


directly applied to each channel graph. As only process typed nodes are available, 
we use the raw attributes of these process x, as the node representation initialization. 
During the graph representation stage, a hierarchical attentional graph neural 
network is designed, including node-wise attentional neural aggregator, layer-wise 
dense-connected neural aggregator, and path-wise attentional neural aggregator. In 
particular, the node-wise attentional neural aggregator aims to generate node em- 
beddings by selectively aggregating the entities in each channel graph based on ran- 
dom walk scores Qi): Layer-wise dense-connected neural aggregator aggregates the 
node embeddings generated from different layers towards a dense-connected node 
embedding. Path-wise attentional neural aggregator performs attentional feature fu- 
sion of the layer-wise dense-connected representations. In the end, the final node 
representation is used as the graph representation. Formally, it can be described as 
follows, 
Node-wise aggregation: 


nO =x, (26.30) 


bY) = AGGnoae(hY OY, HPO e 
i EN) (26.31) 


= k)yp (i (k-1) i (D) ( 
= MLP((1 +£ )h; + L duoh 
uE NG 


Layer-wise aggregation: 


19 — AGGryer(hO 10,1) 


(26.32) 
= MLP(CONCAT hÊ 19); 1) 
Path-wise aggregation: 
By = COMB gy, APE |... 10) (26.33) 
Final node representation: 
nif") =p, (26.34) 
Final graph representation: 
he, =h (26.35) 


where k indicates the number of layers, and € is a small number. Different from 
GEM, HACUD, and DeepHGNN, the object of MatchGNet is matching. The final 
graph representation is fed to a similarity learning framework with contrastive loss 
to deal with the unknown anomaly. During the training, P pairs of program graph 
snapshots (G1), G2) ),4 € {1,2,..-P} are collected with corresponding ground truth 
pairing information y; E€ {+1,—1}. If the pair of graph snapshots belong to the 
same program, the ground truth label is y; = +1; otherwise, its ground truth label 
is y; = —1. For each pair of program snapshots, a cosine score function is used to 
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measure the similarity of the two program embeddings, and the output is defined as 
follows: 


Sim(Gu1),Fi2)) = cos((hy,, hy, )) 
bg, Mo (26.36) 
hga Il: [gy II 


Correspondingly, our objective function can be formulated as: 


P 
f= L (Sina) G2) —y;}? (26.37) 


ll 
un 


26.5.5 Case Study 5: Anomaly Detection in Dynamic Graph Using 
Attention-based Temporal GCN 


Anomaly detection in dynamic graph using attention-based temporal GCN (Add- 
Graph) (Zheng et al, 2019) is the first work that applies the GNN to solve the prob- 
lem of anomaly edge detection in the dynamic graph. It focuses on the modeling of 
the dynamic graph via GNN and performs anomaly connection detection in telecom 
networks and social networks. The graphs are constructed from the edge stream 
data, and the constructed graphs are dynamic, homogeneous, and plain. 

The basic idea is to build a framework to describe the normal edges by using all 
possible features in the graph snapshots in the training phase, including structural, 
content, and temporal features. Then at the prediction stage, the matching objective 
is used similar to MatchGNet. In particular, AddGraph applies GCN AGG° () to 
compute the new current state of a node c’, by aggregating its neighborhoods in the 
current snapshot graph, which can be described as follows, 


ce, = AGG (hi!) (26.38) 


As the state of a node œ, can be computed by aggregating the neighboring hidden 
states in the previous timestamp f — 1, the node hidden states in a short window 
w can be obtained and combined to get the short-term embedding s‘,. In particular, 
an attentional feature fusion is used to combine these node hidden states in a short 
window, as follows, 

s! = COMByy, (hi, ,..., htt) (26.39) 


Then short-term embedding s‘, and current state c/, are fed to GRU, a classic recur- 
rent neural network, to compute the current hidden state that encoding the dynamics 
within the graph. This stage can be described as follows: 


h’ = GRU (cs!) (26.40) 
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The object of AddGraph is matching. The hidden state of the nodes at each times- 
tamp are used to calculate the anomalous probabilities of an existing edge and a 
negative sampled edge, and then feed them to a margin loss. 


26.5.6 Case Study 6: GCN-based Anti-Spam for Spam Review 
Detection 


GCN-based anti-spam (GAS) (Li et al, 2019a) applies the GNN in the spam re- 
view detection at the e-commerce platform Xianyu. Similar to previous works, the 
constructed graph is static, heterogeneous and attributed, such that Y = (V, J, £). 
There are two types of nodes: user nodes ~% and item nodes .%. The edges & are a 
set of comments. Different from previous works, the edges & are the anomalies tar- 
gets. Moreover, as each edge represents a sentence, edge modeling is complicated, 
and the number of edge types increases dramatically. To better capture the edge 
representation, the message-passing-like GNN is used. The edge-wise aggregation 
is proposed by concatenation of previous representation of the edge itself hi! and 
corresponding user node representation h~t, item node representation ni! To get 
the initial attributes of edge, the word2vec word embedding for each word in the 
comments of the edges is extracted via the embedding function pre-training on a 
million-scale comment dataset. Then the word embedding of each words in an edge 
of comments Wo, W1,...W, is fed to TextCNN() function to get the comments em- 
bedding h°, which is used as the initial attributes of edge. Then the edge-wise ag- 
gregation is defined as: 

Edge-wise aggregation: 


h?, = Text CNN (w0, W1, ...Wn) (26.41) 


hk, = MAP,onlinear (CONCAT (hk! hk! 


Iu tu 


h% !)) (26.42) 


2 U 


On the other hand, the node-wise aggregation also needs to take the edges into con- 
sideration. The node-wise aggregation is performed by attention feature fusion of 
the target node and its connected edge followed by a non-linear mapping, which can 
be described with (a) user node-wise aggregation, and (b) item node-wise aggrega- 
tion as follows: 

Node-wise aggregation: 


(a)User node-wise aggregation: 


hk = CONCAT (MAP; ;inear (h1), MAP, ontinear(COMBan (hk! ,CONCAT (hk! hk!) 


(26.43) 
(b)Item node-wise aggregation: 


26 Graph Neural Networks in Anomaly Detection 577 


hj = CONCAT (MAPiinear (hy : ) MAP rontinear(COM Ban: (hy l ,CONCAT (hf; i , hi i ))) 
(26.44) 

where k is the layer indicator. The final edge representation is computed by con- 

catenation of the raw edge embedding h°, new edge embedding hX, corresponding 

new user node embedding h*, and corresponding new item node embedding h* as 

follows: 

Final edge representation: 


nf" — CONCAT (h°, ,hX 


lu vu? K he bt) (26.45) 
The object of GAS is classification, and the final edge representation is fed to a 


standard logistic loss function. 


26.6 Future Directions 


GNNs on anomaly detection is an important research direction, which leverages 
multi-source, multi-view features extracted from both content and structure for 
anomaly sample analysis and detection. It plays a key role in numerous high-impact 
applications in areas such as cyber-security, finance, e-commerce, social network, 
industrial monitoring, and many more mission-critical tasks. Due to the multiple is- 
sues from data, model and task, it still needs a lot of effort in the field. The future 
works are mainly lying in two perspectives: anomaly analysis and machine learning. 

From an anomaly analysis perspective, there are still a lot of research questions. 
How to define and identify the anomalies in the graph in the different tasks? How to 
effectively convert the large-scale raw data to the graph? How to effectively leverage 
the attributes? How to model the dynamic during the graph construction? How to 
keep the heterogeneity during the graph construction? Recently, due to the data- 
specific and task-specific issues, the applications of GNN-based anomaly detection 
are still limited. There is still a lot of potential scenarios that can be applied. 

From a machine learning perspective, lots of issues need to be addressed. How 
to model the graph? How to represent the graph? How to leverage the context? How 
to fuse the content and structure features? Which part of the structure to capture, 
local or global? How to provide the model explainability? How to protect the model 
from adversarial attacks? How to overcome the time-space scalability bottleneck. 
Recently, lots of contributions have been made from the machine learning perspec- 
tive. However, due to the unique characteristics of the anomaly detection problem, 
which GNNs to use and how to apply GNNs are still critical questions. Further 
work will also benefit from the new findings and new models in the graph machine 
learning community. 
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Editor’s Notes: Graph neural networks for anomaly detection can be con- 
sidered as a downstream task of graph representation learning, where the 
long-term challenges in anomaly detection are coupled with the vulnera- 
bility of graph neural networks such as scalability discussed in Chapter 6 
and robustness discussed in Chapter 8. Graph neural networks for anomaly 
detection also further benefits a wide range of downstream tasks in various 
interesting, important, yet usually challenging areas such as anomaly detec- 
tion in dynamic networks, spam review detection for recommender system, 
and malware program detection, which are highly relevant to the topics in- 
troduced in Chapters 15, 19, and 22. 
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Graph Neural Networks in Urban Intelligence 


Yanhua Li, Xun Zhou, and Menghai Pan 


Abstract In recent years, smart and connected urban infrastructures have undergone 
a fast expansion, which increasingly generates huge amounts of urban big data, such 
as human mobility data, location-based transaction data, regional weather and air 
quality data, social connection data. These heterogeneous data sources convey rich 
information about the city and can be naturally linked with or modeled by graphs, 
e.g., urban social graph, transportation graph. These urban graph data can enable 
intelligent solutions to solve various urban challenges, such as urban facility plan- 
ning, air pollution, etc. However, it is also very challenging to manage, analyze, and 
make sense of such big urban graph data. Recently, there have been many studies 
on advancing and expanding Graph Neural Networks (GNNs) approaches for var- 
ious urban intelligence applications. In this chapter, we provide a comprehensive 
overview of the graph neural network (GNN) techniques that have been used to em- 
power urban intelligence, in four application categories, namely, (i) urban anomaly 
and event detection, (ii) urban configuration and transportation planning, (iii) ur- 
ban traffic prediction, and (iv) urban human behavior inference. The chapter also 
discusses future directions of this line of research. The chapter is (tentatively) orga- 
nized as follows. 
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27.1 Graph Neural Networks for Urban Intelligence 


27.1.1 Introduction 


According to the report (Desa, 2018) published by the United Nations in 2018, the 
urban population in the world reached 55 percent in 2018, which is growing rapidly 
over time. By 2050, the world will be one-third rural (34 percent) and two-thirds 
urban (66 percent). Moreover, thanks to the fast development of sensing technolo- 
gies in recent years, various sensors are widely deployed in the urban areas, e.g., the 
GPS sets on vehicles, personal devices, air quality monitoring stations, gas pressure 
regulators, etc. Stimulated by the large urban population and the wide use of the 
sensors, there are massive data generated in the urban environment, for example, 
the trajectory data of the vehicles in ride-sharing services, the air quality monitoring 
data. Given a large amount of heterogeneous urban data, the question to answer is 
what and how can we benefit from these data. For instance, can we use the GPS data 
of the vehicles to help urban planners better design the road network? Can we infer 
the air quality index across the city based on a limited number of existing monitor- 
ing stations? To answer these practical questions, the interdisciplinary research area, 
Urban Intelligence, has been extensively studied in recent years. In general, Urban 
Intelligence, which is also referred as urban computing, is a process of acquisition, 
integration, and analysis of big and heterogeneous data generated by a diversity of 
sources in urban spaces, such as sensors, devices, vehicles, buildings, and humans, 
to tackle the major issues in cities (Zheng et al, 2014). 

Data analytics (e.g., data mining, machine learning, optimization) techniques are 
usually employed to analyze numerous types of data generated in the urban scenar- 
ios for prediction, pattern discovery, and decision-making purposes. How to repre- 
sent urban data is an essential question for the design and implementation of these 
techniques. Given the heterogeneity of urban big data, various data structures can 
be used to represent them. For example, spatial data in an urban area can be rep- 
resented as raster data (like images), where the area is partitioned into grid cells 
(pixels) with attribute functions imposed on them (Pan et al, 2020b; Zhang et al, 
2019, 2020b,a; Pan et al, 2019, 2020a). Spatial data can also be represented as a 
collection of objects (e.g., vehicles, point-of-interests, and trajectory GPS points) 
with their locations and topological relationships defined (Ding et al, 2020b). 

Moreover, the intrinsic structures of many urban big data enable people to rep- 
resent them with graphs. For instance, the structure of urban road network helps 
people model the traffic data with graphs (Xie et al, 2019b; Dai et al, 2020; Cui 
et al, 2019; Chen et al, 2019b; Song et al, 2020a; Zhang et al, 2020e; Zheng et al, 
2020a; Diao et al, 2019; Guo et al, 2019b; Li et al, 2018e; Yu et al, 2018a; Zhang 
et al, 2018e); the pipeline of gas supply network enable people to model the gas 
pressure monitoring data with graph (Yi and Park, 2020); people can also represent 
the data on the map with a graph by dividing the city into functional regions (Wang 
et al, 20190; Yi and Park, 2020; Geng et al, 2019; Bai et al, 2019a; Xie et al, 2016). 
Representing urban data with graphs can capture the intrinsic topological informa- 
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tion and knowledge in the data, and plenty of techniques are developed to analyze 
the urban graph data. 

Graph Neural Networks (GNNs) are naturally employed to solve various real- 
world problems with urban graph data. For example, Convolutional Graph Neu- 
ral Networks (ConvGNN) (Kipf and Welling, 2017b) are used to capture the spa- 
tial dependencies of the urban graph data, and Recurrent Graph Neural Networks 
(RecGNN) (Li et al, 2016b) are for the temporal dependencies. Spatial-temporal 
Graph Neural Networks (STGNN) (Yu et al, 2018a) can capture both spatial and 
temporal dependencies in the data, which are widely used in dealing with many ur- 
ban intelligence problems, e.g., predicting traffic status based on urban traffic data 
(Zhang et al, 2018e; Li et al, 2018e; Yu et al, 2018a). The traffic data are modeled 
as spatial-temporal graphs where the nodes are sensors on road segments, and each 
node has the average traffic speed within a window as dynamic input features. 

In the following sections, we first summarize the general application scenarios in 
urban intelligence, followed by the graph representations in urban scenarios. Then, 
we provide more details on GNN for urban configuration and transportation plan- 
ning, urban anomaly and event detection, and urban human behavior inference, re- 
spectively. 


27.1.2 Application scenarios in urban intelligence 


The diverse application domains in urban intelligence include urban planning, trans- 
portation, environment, energy, human behavior analysis, economy, and event de- 
tection, etc. In the following paragraphs, we will introduce the practical problems 
and the common datasets in these domains. The problems and examples highlighted 
below are not exhaustive, here we just introduce some critical problems and typical 
examples from literature, which are summarized in Table 27.1. 

1) Urban configuration. Urban configuration is essential for enabling smart cities. 
It deals with the design problem of the entire urban area, such as, the land use, the 
layout of human settlements, design of road networks, etc. The problems in this 
domain includes estimating the impact of a construction (Zhang et al, 2019c), dis- 
covering the functional regions of the city (Yuan et al, 2012), detecting city bound- 
aries (Ratti et al, 2010), etc. In (Zhang et al, 2019c), the authors employ and ana- 
lyze the historical taxi GPS data and the road network data, where they define the 
off-deployment traffic estimation problem as a traffic generation problem, and de- 
velop a novel deep generative model TrafficGAN that captures the shared patterns 
across spatial regions of how traffic conditions evolve according to travel demand 
changes and underlying road network structures. This problem is important to city 
planners to evaluate and develop urban deployment plans. In (Yuan et al, 2012), the 
authors propose a DRoF framework that Discovers Regions of different Functions 
in a city using human mobility between regions with data collected from the GPS 
set in Taxis in Beijing and points of interest (POIs) located in the city. The under- 
standing of functional regions in a city can calibrate urban planning and facilitate 
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Table 27.1: Application domain and examples in urban intelligence. 


Application domain Example task Example data source 
Estimate impact of construction 
(Zhang et al, 2019c) 
Discover functional regions 
(Yuan et al, 2012) 
Improve efficiency of taxi drivers 
(Pan et al, 2019) 


Urban configuration Taxi GPS, road network. 


Taxi GPS, POIs. 


Transportation Taxi GPS, road network. 


Air quality data from monitor 
stations, road network, POIs. 


Taxi GPS. 


Environment Infer air quality(Zheng et al, 2013) 


Estimate gas consumption 
(Shang et al, 2014) 
Human behavior |Estimate user similarity(Li et al, 2008)| GPS data from phones. 
Place retail store 
(Karamshuk et al, 2013) 
Detect anomalous traffic pattern 
(Pang et al, 2011) 


Energy consumption 


Economy POIs, human mobility data. 


Public Safety Taxi GPS, road network. 


other applications, such as choosing a location for a business. In (Ratti et al, 2010), 
the authors propose a model to detect the city’s boundary by analyzing the human 
network inferred from a large telecommunications database in Great Britain. An- 
swering this question can help the city planner get a sense on what the exact range 
the urban area is within as the urban area changes fast over time. 

2) Transportation. Transportation plays an important role in the urban area. Urban 
intelligence deals with several problems regarding the transportation in the city, e.g., 
routing for the drivers, estimating the travel time, improving the efficiency of taxi 
system and the public transit system, etc. In (Yuan et al, 2010), the authors propose a 
T-Drive system, that provides personalized driving directions that adapt to weather, 
traffic conditions, and a person’s own driving habits. The system is built based on 
historical trajectory data of taxicabs. In (Pan et al, 2019), the authors propose a solu- 
tion framework to analyze the learning curve of taxi drivers. The proposed method 
first learns the driver’s preference to different profiles and habit features in each 
time period, then analyzes the preference dynamics of different groups of drivers. 
The results illustrate that taxi drivers tend to change their preference to some habit 
features to improve their operation efficiency. This finding can help the new drivers 
improve their operation efficiency faster. The authors in (Watkins et al, 2011) con- 
ducted a study on the impact of providing real-time bus arrival information directly 
on riders’ mobile phones and found it to reduce not only the perceived wait time of 
those already at a bus stop, but also the actual wait time experienced by customers 
who plan their journey using such information. 

3) Urban Environment. Urban intelligence can deal with the potential threat to the 
environment caused by the fast pace of urbanization. The environment is essential 
for people’s health, for example, air quality, noise, etc. In (Zheng et al, 2013), the 
authors infer the real-time and fine-grained air quality information throughout a city 
based on the (historical and real-time) air quality data reported by existing monitor 
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stations and a variety of data sources observed in the city, such as meteorology, traf- 
fic flow, human mobility, structure of road networks, and POIs. The results can be 
used to suggest people when and where to conduct outdoor activities, e.g., jogging. 
Also, the result can infer suitable locations for deploying new air quality monitoring 
stations. Noise pollution is usually serious in the urban area. It has impacts to both 
the mental and physical health of human beings. Santini et al (2008) assess environ- 
mental noise pollution in urban areas by using the monitoring data from wireless 
sensor networks. 

4) Energy supply and consumption. Another application domain of urban intelli- 
gence is energy consumption in the urban area, which usually deals with the problem 
of sensing city-scale energy cost, improving energy infrastructures, and finally re- 
ducing energy consumption. The common energy include gas and electricity. Shang 
et al (2014) inferred the gas consumption and pollution emission of vehicles travel- 
ing on a city’s road network in the current time slot using GPS trajectories from a 
sample of vehicles (e.g., taxicabs). The knowledge can be used not only to suggest 
cost-efficient driving routes but also to identify the road segments where gas has 
been wasted significantly. Momtazpour et al (2012) proposes a framework to pre- 
dict electronic vehicle (EV) charging needs based on owners’ activities, EV charg- 
ing demands at different locations in the city and available charge of EV batteries, 
and design distributed mechanisms that manage the movements of EVs to different 
charging stations. 

5) Urban human behavior analysis. With the popularization of smart devices, 
people can generate massive location-embedded information every day, such as, 
location-tagged text, image, video, check-ins, GPS trajectories. The first question in 
this domain is estimating user similarity, and similar users can be recommended as 
friends. Li et al (2008) connects users with similar interests even when they may not 
have known each other previously, and community discovery, which employs the 
GPS trajectories collected from GPS equipped devices like phones. 

6) Economy. Urban intelligence can benefit the urban economy. The human mobil- 
ity and the statistics of POIs can reflect the economy of the city. For example, the 
average price of a dinner in the restaurants can indicate the income level and the 
power of consumption. In (Karamshuk et al, 2013), the authors study the problem 
of optimal retail store placement in the context of location-based social networks. 
They collected human mobility data from Foursquare and analyzed it to understand 
how the popularity of three retail store chains in New York is shaped in terms of 
number of check-ins. The result indicates that some POIs, like train station and air- 
port, can imply the popularity of the location, also, the number of competitive stores 
is an indicator for the popularity. 

7) Public safety. Public safety and security in the urban area is always attracting 
people’s concerns. The availability of different data enable us to learn from his- 
tory how to deal with public safety problems, e.g., traffic accident (Yuan et al, 
2018), large event (Vahedian et al, 2019; Khezerlou et al, 2021, 2017; Vahedian 
et al, 2017), pandemic (Bao et al, 2020), etc., and we can use the data to detect 
and predict abnormal events. Pang et al (2011) detects the anomalous traffic pattern 
from the spatial-temporal data of vehicles. The authors partition a city into uniform 
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grids and counted the number of vehicles arriving in a grid over a time period. The 
objective was to identify contiguous sets of cells and time intervals that have the 
largest statistically significant departure from expected behavior (1.e., the number of 
vehicles). 


27.1.3 Representing urban systems as graphs 


Various data structures and models can be employed to define the spatial settings 
of urban systems. For example, a simple model is a grid structure, where the ur- 
ban area is partitioned into grid cells, with a set of attribute values of interest (e.g., 
average traffic speed, number of taxis, population, rainfall) associated with each 
cell. While such a model is simple to implement, it ignores many intrinsic and im- 
portant relationships existing in urban data. For example, a grid structure may lose 
the information of road connectivity in the underlying traffic system of the city. In 
many scenarios, instead, graph is an elegant choice to capture the intrinsic topolog- 
ical information and knowledge in the data. Many urban system components can 
be represented as graphs. Additional attributes may be associated with nodes and/or 
edges. In this section, we introduce graph representations of various urban system 
scenarios, which are summarized in Table 27.2. The application domains covered 
include 1) Urban transportation and configuration planning, 2) Urban environment 
monitoring, 3) Urban energy supply and consumption, 4) Urban event and anomaly 
detection, and 5) Urban human behavior analysis. 

1) Urban transportation and configuration planing. Modeling urban trans- 
portation system as a graph is widely used in solving real-world urban intelligence 
problems, e.g., traffic flow prediction (Xie et al, 2019b; Dai et al, 2020; Cui et al, 
2019; Chen et al, 2019b; Song et al, 2020a; Zhang et al, 2020e; Zheng et al, 2020a; 
Diao et al, 2019; Guo et al, 2019b; Li et al, 2018e; Yu et al, 2018a; Zhang et al, 
2018e), parking availability problem (Zhang et al, 2020h), etc. The graphs are usu- 
ally built based on the real-world road network. To solve the problem of traffic flow 
prediction, in (Cui et al, 2019), the authors employ an undirected graph to predict 
the traffic state, the nodes are the traffic sensing locations, e.g., sensor stations, road 
segments, and the edges are the intersections or road segments connecting those 
traffic sensing locations. Xie et al (2019b); Dai et al (2020) model the urban traffic 
network as a directed graph with attributes to predict the traffic speed, the nodes 
are the road segments, and the edges are the intersections. Road segment width, 
length, and direction are the attributes of the nodes, and the type of intersection, 
and whether there are traffic lights, toll gates are the attributes of the edges. For 
urban configuration, Wu et al (2020c) incorporates a hierarchical GNN framework 
to learn Road Network Representation in different levels. The nodes in the hierar- 
chical graph include road segments, structural regions, and functional zones, and 
the edges are intersections and hyperedges. There are some works about predicting 
parking availability. Zhang et al (2020h) models the parking lots and the surround- 
ing POIs and population features as a graph to predict the parking availability for 
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Table 27.2: Graph representations in urban systems 


Application domain Nodes Edges Examples 
Traffic flow prediction 
(Xie et al, 2019b) 
(Dai et al, 2020) 
(Cui et al, 2019) 
(Chen et al, 2019b) 
(Song et al, 2020a) 


Road segments Intersections (Zhang et al, 2020e) 

Transportation & (Zheng et al, 2020a) 
configuration (Diao et al, 2019) 
planning (Guo et al, 2019b) 


(Li et al, 2018e) 

(Yu et al, 2018a) 
(Zhang et al, 2018e) 
Learning road network 
representation (Wu et al, 2020c) 
Parking availability prediction, 
POI recommendation 
(Zhang et al, 2020h) 
(Chang et al, 2020a) 

Air quality inference 

Monitoring sensors Proximity (Wang et al, 2020h) 
(Li et al, 2017f) 
Gas pressure monitoring 


Functional zones |Road connections 


POIs Road connections 


Environment 
monitoring 


Energy supply 


& consumption Regulators Pipelines (Yi and Park, 2020) 
Traffic accident prediction 
Event & : D (Zhou et al, 2020g) 
anomaly detection Urban regions Proximity (Zhou et al, 2020h) 
(Yu et al, 2021b) 

Human behavior Sessions, Eventi User behavior modeling 

analysis locations, objects aati (Wang et al, 2020a) 
Passenger demand prediction 

(Wang et al, 20190) 
Urban regions Proximity CS and Farii 2) 


(Geng et al, 2019) 
(Bai et al, 2019a) 
(Xie et al, 2016) 


the parking lots. The nodes are the parking lots, and the edges are determined by 
the connectivity between each two parking lots whose on-road distance is smaller 
than a threshold. Context features, e.g., POI distribution, population, etc., are the 
attributes of the nodes. 

2) Urban environment monitoring system. People model the air quality mon- 
itoring system as a graph to forecast the air quality in the urban area(Wang et al, 
2020h; Li et al, 2017f). For example, Wang et al (2020h) proposed the PM2.5-GNN 
to forecast the PM2.5 index in different locations. The nodes are locations deter- 
mined by latitude, longitude, altitude, and there exists an edge between two nodes 
if the distance and difference of altitudes between them are less than threshholds re- 
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spectively (e.g., distance < 300 km and difference of altitudes < 1200 m). The node 
attributes include Planetary Boundary Layer (PBL) height, K index, wind speed, 2m 
temperature, relative humidity, precipitation, and surface pressure. Edge attributes 
include wind speed of source node, distance between source and sink, wind direc- 
tion of source node, and direction from source to sink. 

3) Urban energy supply and consumption. GNN is also employed in analyz- 
ing urban energy supply and consuming systems. For example, Yi and Park (2020) 
proposed a framework to predict the gas pressure in the gas supply network. The 
gas regulators are considered as the nodes, and the pipelines that connect every two 
regulators are the edges. 

4) Urban event and anomaly detection. Urban event and anomaly detection is a 
hot topic in urban intelligence. People employ machine learning models to detect or 
predict the events occurring in the urban area, e.g., traffic accident prediction(Zhou 
et al, 2020g,h; Yu et al, 2021b). In (Zhou et al, 2020g), the authors proposed a 
framework to predict traffic accident in different regions of the city. The urban area 
is divided into subregions, i.e., grids, and if the traffic elements within two subre- 
gions have strong correlations, there is a connection. 

5) Urban human behavior analysis. Studying human behavior in urban region 
can benefit people in many aspects, for example, demographic attribute prediction, 
personalized recommendation, passenger demand prediction, etc. Some works pro- 
posed GNN to study Human behavior modeling. Human behavior modeling is es- 
sential for many real-world applications such as demographic attribute prediction, 
content recommendation, and target advertising. In (Wang et al, 2020a), the authors 
model human behavior via a tripartite graph. The nodes include user’s sessions, lo- 
cations and items. There exists an edge between a session node and a location node if 
the user started the session at this location. Similarly, there exists an edge between 
a session node and an item node if the user interacted with this item within the 
session. Each edge possesses a time attribute indicating the temporal signal of the 
interaction between two nodes. Another application of analysing human behavior is 
passenger demand prediction. Understanding human behavior in daily transits can 
help improve the efficiency of urban transportation system. For example, predicting 
the passenger demand in the ride-sharing system can help the ride-sharing company 
and the drivers improve their operation efficiency. And in recent publications, many 
researchers employ graph neural networks to solve the problem of predicting human 
mobility (Wang et al, 20190; Yi and Park, 2020; Geng et al, 2019; Bai et al, 2019a; 
Xie et al, 2016), and usually the nodes of the graph are subregions of the city, and 
the edges are usually defined based on spatial proximity. 


27.1.4 Case Study 1: Graph Neural Networksin urban 
configuration and transportation 


Urban intelligence can help urban planners design urban configuration, and bene- 
fit the urban transportation system from different perspectives, e.g., operation effi- 
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Figure 27.1: CNN-based STGNN 


ciency, safety, environmental protection, etc. To enable urban intelligence in urban 
configuration and transportation planning, researchers developed practical machine 
learning approaches, including graph neural networks (GNN), to deal with real- 
world problems. In this section, we introduce some state-of-the-art (SOTA) designs 
of GNN targeting on solving the real-world urban configuration and transportation 
problems. 

Urban traffic prediction. Predicting traffic status, e.g., speed, volume, is im- 
portant in enabling urban intelligence. The traffic prediction problem is a typical 
time-series prediction problem: 


Definition 27.1. Urban traffic prediction problem. Given historical traffic obser- 
vations and context features of the road network, predicting the traffic status (e.g., 
speed, flow, etc.) in future time slots over the road network. 


To address the traffic prediction problem, Spatial-temporal Graph Neural Networks 
(STGNN) are usually employed. The road segments are the nodes, and the traf- 
fic status is the attributes of the nodes. The traffic status in different time slots 
are corresponding to the temporal dynamics of the graph. Usually, graph convo- 
lution operation is used to capture the spatial dependencies among the nodes, and 
a 1D-convolution operation is then employed to capture the temporal dependencies 
among different time slots. The framework of CNN-based STGNN is illustrated in 
Fig.27.1. The spatial-temporal embeddings can be used to predict the traffic status. 

Another design of STGNN is based on Recurrent Neural Networks (RNN), 
which can also predict traffic status in Spatial-temporal graphs. Most RNN-based 
approaches capture spatial-temporal dependencies by filtering inputs and hidden 
states passed to a recurrent unit using graph convolution operations. The basic RNN 
can be formulated in Eq. (27.1). 


HO =o(WxX+uH™) +b), (27.1) 


where X) is the node feature matrix at time step t. H is the hidden state. W, U, and 
b are the network parameters. Then, the STGNN based on RNN can be formulated 
as Eq. (27.2): 
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Figure 27.2: Hierarchical road network graph 


HO = 6(Gconv(X ,A;W) + Gconv(H"—)),A;U) +b), (27.2) 


where Gconv(-) is the graph convolution operation, and A is the graph adjacency 
matrix. Both designs of STGNN can be employed to predict the node attributes, i.e., 
traffic status, given the spatial-temporal graph of traffic. 

Urban configuration. An urban road network is a vital component in urban con- 
figuration. How to represent it is essential for many analyses and researches related 
to real-world applications. As a real-world road network is a complex system with 
hierarchical structures, long-range dependency among units, and functional roles, it 
is challenging to design effective representation learning methods. The road network 
representation learning problem can be defined like this: 


Definition 27.2. Road network representation learning problem. Given a road 
network, the target is to construct the corresponding graphs that can represent the 
structure and topological information of the road network. 


Benefit from the topology of graph, we can represent road network with hier- 
archical graphs. In (Wu et al, 2020c), the authors propose to represent urban road 
networks with a hierarchical graph with three levels, and the node in each level cor- 
responds to road segments, structural region, and functional zone, respectively, as 
illustrated in Fig.27.2. The structural region is the aggregation of some connected 
road segments, which serves as some specific traffic roles, e.g., intersection, over- 
pass. And functional zone is the aggregation of structural regions, which can repre- 
sent some functional facilities in the city, e.g., transportation hub, shopping area. To 
learn the hierarchical graph representation, the road segments are first represented 
by contextual embedding, e.g., road type, lane number, segment length, etc. Then, 
graph clustering and network reconstruction techniques are employed to form the 
structural region graph. And vehicle trajectory data is employed to capture the func- 
tional zones over structural regions. 
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Figure 27.3: Example GNN framework for traffic accident prediction. 


27.1.5 Case Study 2: Graph Neural Networks in urban anomaly 
and event detection 


Public safety and security in the urban area always attracts people’s concerns. The 
availability of different data enables us to learn from history how to deal with public 
safety problems, e.g., traffic accidents, crime, large events, pandemic, etc., and we 
can use the data to detect and predict abnormal events. 

Traffic accident prediction. Traffic accident prediction is of great significance 
to improve the safety of the road network. Although “accident” is a word related to 
“randomness”, there exist a significant correlation between the occurrence of traffic 
accidents and the surrounding environmental features, e.g., traffic flow, road net- 
work structure, weather, etc. Thus, machine learning approaches, like GNN, can be 
employed to predict or forecast traffic accidents over the city, which can help enable 
urban intelligence. 

The problem of traffic accident prediction is as follows: 


Definition 27.3. Traffic accident prediction problem. Given the road network data 
and the historical environmental features, the target is to predict the traffic accident 
risk over the city in the future. 


The environmental features include the traffic conditions, surrounding POIs, etc. In 
recent publications (Zhou et al, 2020g,h; Yu et al, 2021b), GNN is employed to 
solve this problem. 

The graphs in solving traffic accident problem are usually constructed based on 
dividing the urban area into grids, and each grid is considered as a node. If the traffic 
conditions between two nodes have a strong correlation, there is an edge between 
them. The context environmental features are the attributes with each grid. After the 
graphs are constructed in different historical time slots, graph convolutional neural 
networks (GCNs) are usually used to extract the hidden embedding in each time 
slot. Then, methods dealing with time-series inputs can be employed to capture 
the temporal dependencies, e.g., RNN-based neural networks. Finally, the spatial- 
temporal information is used to predict traffic accident risk over the city. Overall, 
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Figure 27.4: Example STGNN framework for passenger demand prediction. 


the solution framework can be considered as an STGNN as illustrate in Fig. 27.3. 
For more details, please refer to (Zhou et al, 2020¢,h; Yu et al, 2021b). 


27.1.6 Case Study 3: Graph Neural Networks in urban human 
behavior inference 


Human behavior analysis plays an important role in enabling urban intelligence, 
for example, studying the behavior of drivers can help improve the efficiency of 
urban transportation system, analysing passenger behaviors can help improve the 
operation efficiency of the drivers in taxi or ride-hailing services, and understanding 
user behavior pattern can help improve personal recommendation of commercial 
items, which will benefit the urban economy. In this section, we demonstrate how 
GNN works in analyzing urban human behaviors via two real-world applications, 
i.e., passenger demand prediction and user behavior modeling. 

Passenger demand prediction. Passenger demand prediction is mostly con- 
ducted at the region-level, i.e., the urban area is divided into small grids. The prob- 
lem can be defined as follows: 


Definition 27.4. Passenger demand prediction problem. Given the historical de- 
mands and context features distributions, the task is to predict the passenger demand 
in each region. 


Different from most traffic graphs which construct the graphs with road segments as 
nodes, here in passenger demand prediction problem, people usually construct the 
graph with grids as the nodes. The edges, i.e., the correlations between each pair of 
nodes, are determined by spatial proximity, similarity of contextual environment, or 
road network connectivity for distant grids. 

Spatial-temporal Graph Neural Networks (STGNN) are the most popular GNN 
models employed in predicting passenger demand. In (Geng et al, 2019), the au- 
thors propose the spatiotemporal multi-graph convolution network (ST-MGCN) to 
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predict the passenger demand in the ride-hailing service. The overall framework 
can be illustrated as in Fig.27.4. First, multiple graphs are constructed based on dif- 
ferent aspects of relationships between each two grids, i.e., proximity, functional 
similarity, and transportation connectivity. Then, a RNN is used to aggregate obser- 
vations in different times considering the global contextual information. After that, 
GCN is used to model the non-Euclidean correlations among regions. Finally, the 
aggregated embeddings are used to predict the passenger demand over the city. 

User behavior modeling. Modeling human behavior is important for many real- 
world applications, e.g., demographic attribute prediction, content recommendation, 
and target advertising, etc. Studying human behavior in the urban scenario can ben- 
efit urban intelligence in many aspects, e.g., economy, transportation, etc. Here, 
we introduce an example of modeling spatial-temporal user behavior with tripartite 
graphs (Wang et al, 2020a). 

Take the urban user online browsing behavior as an example, the spatial-temporal 
user behavior can be defined on a set of users U, a set of sessions S, a set of items 
V, and a set of locations L. Each user’s behavior log can be represented by a set of 
session-location tuples, and each session contains multiple item-timestamp tuples. 
Then a user’s spatial-temporal behavior can be captured via a tripartite graph as 
illustrated in Fig.27.5. The nodes of this tripartite graph include user’s sessions S, 
locations L, and items V. The edges include session-item edges and session-location 
edges. 


Item nodes Session nodes Location nodes 


Figure 27.5: Spatial-temporal user behavior graph 


To extract the user representation from each user’s spatial-temporal behavior 
graph, GNN can be employed. The idea is to extract session embeddings from the 
items within each session, and RNN can be employed to aggregate the information 
of items. Then session embeddings are further aggregated into temporal embeddings 
of different time span, e.g., day, week. Also, the session embeddings and locations 
are composed to produce the spatial embeddings. Last, the spatial and temporal em- 
beddings are fused into one embedding which can represent the user’s behavior. For 
more details, we would like to refer you to (Wang et al, 2020a). 
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27.1.7 Future Directions 


It is inspiring that GNNs have obtained significant achievements on urban intelli- 
gence. For future research, we envision that there exist several potential directions 
as following. 

Interpretability of the GNNs model on urban intelligence. The applications 
of GNNs on urban intelligence are closely related to real-world problems. Besides 
improving the performance of the GNNs model, it is necessary to enhance the in- 
terpritability of the GNNs model. For example, in the application predicting traffic 
flow, it is important to identify hidden factors (e.g., structure of road network) that 
can affect the traffic flow. These hidden factors may also help urban planners better 
design road network to balance the traffic flow. 

Recent advances in interpretable AI and machine learning research have led to 
the development of numerous intrinsic or post-hoc interpretable graph neural net- 
work models (Huang et al, 2020c). However, few of them are designed for GNNs 
on urban problems. Designing interpretable urban GNNs is non-trivial due to the 
unique properties of urban big data. For example, urban data are usually heteroge- 
neous, i.e., the interpretation of learned relationships between the input features and 
target variables vary over space. For example, the risk factors for traffic accidents 
may shift when moving from a densely populated area to a non-residential area. 
Also, the interpretation model of GNN at nearby locations (e.g., neighboring nodes) 
share similarities due to the auto-correlation of spatial data (Pan et al, 2020b). These 
factors should be considered when designing interpretable urban GNNs. 

New applications for GNNs on urban intelligence. As introduced above, 
GNNs have demonstrated their effectiveness and efficiency in many applications 
domains in urban intelligence, e.g., transportation, environment, energy, safety, hu- 
man behavior. There exist potential applications of GNNs on urban scenario, such 
as, improving urban power (electricity) supply, contact tracing of patients of infec- 
tious diseases (e.g., COVID-19), and modeling responses to complex environmental 
and climate events (e.g., flood, Hurricane, etc). 
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Editor’s Notes: Urban intelligence covers a wide range of macro-scale 
physical networks such as transportation networks and power grids. They 
are typical cases of spatial networks, which are networks whose nodes and 
edges are embedded in space probably under spatial constraints (e.g., pla- 
narity). So it is not a surprise that urban intelligence could largely benefit 
from deep learning techniques for spatial data and network data. Differ- 
ent from most of the application domains introduced in Chapters 19-27, 
there are usually well-designed computational models for many subareas 
in urban intelligence, so it is important to explore how deep graph learning 
techniques can contribute and compensate for the weakness of the existing 
strategies. 
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